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Abstract 

We consider a supercritical branching population, where individuals have i.i.d. lifetime du- 
rations (which are not necessarily exponentially distributed) and give birth (singly) at constant 
rate. We assume that individuals independently experience neutral mutations, at constant rate 9 
during their lifetimes, under the infinite-alleles assumption: each mutation instantaneously con- 
fers a brand new type, called allele or haplotype, to its carrier. The type carried by a mother at 
the time when she gives birth is transmitted to the newborn. 

We are interested in the sizes and ages at time t of the clonal families carrying the most 
abundant alleles or the oldest ones, as t — > oo, on the survival event. Intuitively, the results must 
depend on how the mutation rate 9 and the Malthusian parameter a > compare. Hereafter, 
N = Nt is the population size at time t, constants a,c are scaling constants, whereas k, k' are 
explicit positive constants which depend on the parameters of the model. 

When a > 9, the most abundant families are also the oldest ones, they have size cJV 1 - /" and 
age t — a. 

When a < 9, the oldest families have age (a/9)t+a and tight sizes; the most abundant families 
have sizes k\og(N) — fc'loglog(TV) + c and all have age (9 — a) -1 log(t). 

When a — 9, the oldest families have age kt — k' log(i) -I- a and tight sizes; the most abundant 
families have sizes (fclog(iV) — fc'loglog(iV) + c) 2 and all have age t/2. 

Those informal results can be stated rigorously in expectation. Relying heavily on the theory 
of coalescent point processes [T31 [IB], we are also able, when a < 9, to show convergence in 
distribution of the joint, properly scaled ages and sizes of the most abundant /oldest families and 
to specify the limits as some explicit Cox processes. 

This is in deep contrast with the largest/oldest families in the standard coalescent with Pois- 
sonian mutations, which converge to some point processes after being rescaled by N 0111 E|. 
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1 Introduction and motivation 

We consider a general branching population, where individuals reproduce independently of each 
other, have i.i.d. lifetime durations with arbitrary distribution, and give birth at constant rate during 
their lifetime. We also assume that each birth gives rise to a single newborn. The genealogical tree 
associated with this construction is known as a splitting tree [H [13] . The process (N t ; t > 0) 
counting the population size is a non-Markovian birth-death process belonging to the class of general 
branching processes, or Crump-Mode- J agers (CMJ) processes. Since births arrive singly and at 
constant rate, these processes are sometimes called homogeneous, binary CMJ processes. 

Also, individuals are given a type, called allele or haplotype. They inherit their type at birth 
from their mother, and (their germ line) can change type throughout their lifetime, at the points 
of independent Poisson point processes with rate 9, conditional on lifetimes (neutral mutations). 
The type conferred by a mutation is each time an entirely new type, an assumption known as the 
infinitely-many alleles model. 

We are interested in the so-called allelic partition (partition into types) of the population alive 
at time t. In [21 [12], we obtained explicit formulae for the expected frequency spectrum of the allelic 
partition. The frequency spectrum is a convenient way of describing this partition without labelling 
types. It is defined as the sequence of numbers (Ag(k,t),k > 0), where Ag(k,t) is the number of 
types carried by k individuals at time t. For example in [2], we have derived explicit formulae for the 
expectation of Ag(k,t) conditional on population size at t. From these formulae, using the theory of 
branching processes counted by random characteristics, we have specified the a.s. limit, as f -> oo, 
of the fraction of types carried by a fixed number k of individuals. 

If we call clonal families, or simply families, the equivalence classes associated to identity by type 
(i.e., the components of the allelic partition), it is usual to call small families the families of sizes k = 
1,2,3,... Here, large families will refer to families with most frequent (i.e., abundant) types having 
alive representatives, and old families to families having oldest types with alive representatives, 
where the age of a type is the time elapsed since its original mutation. In the present work, we are 
interested in the asymptotic behavior, as t — > oo, of the sizes and ages of large and of old families. 

The most celebrated mathematical result regarding allelic partitions is Ewens' sampling formula, 
which provides the distribution of the frequency spectrum for the Kingman coalescent tree with 
neutral Poissonian mutations [5]. It has notably been shown [31 E] that under this model, the largest 
(resp. oldest) families converge, after being rescaled by the population size N, to the Poisson- 
Dirichlet (resp. GEM) distribution. We will see here, that, for example, the largest families are 
never of the order of N, but depending on how the Malthusian parameter a scales with the mutation 
rate 9, of the order of N 1 ^ / (case 9 < a), order of (logiV) 2 (case 9 = a), or order of logiV 
(case 9 > a). The first case (9 < a) shows more similarities with the frequency spectrum of the 
Beta-coalescent pQ. 
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We refer the reader to [2] for more references on the topic of allelic partitions, especially regarding 
branching processes. For example, similar questions were studied for general CMJ processes, when 
mutations occur at birth, in the monography due to Z. Ta'ib |18| . These results rely heavily on the 
theory of branching processes counted by random characteristics, due to P. Jagers and O. Nerman 
[8l[9lll0lll5|. and more specifically on time dependent random characteristics as developed in [ID]. Z. 
Tai'b obtains results of convergence in distribution of the (correctly rescaled) point process of ages, 
similar to the results we obtain in Sections H] and El However, the techniques of [H] do not apply to 
the case where mutations occur during individuals' lifetimes, since the genealogical tree of types is 
not the one of a CMJ process in this case. 



One of the initial motivations of [2] and of the present work was the following model inspired 
from the works of P.C. Sabeti and her coauthors (see e.g. p2]). Inside a large population, consider a 
subpopulation consisting of individuals carrying a specific selective gene called 'core haplotype' and 
thus experiencing demographic growth. The haplotypic structure of the subpopulation restricted to 
a portion of length L around the core haplotype on the chromosome carrying it, is assumed to be 
altered by recombination. As long as the total population is sufficiently large w.r.t. the growing 
subpopulation, each time a sequence belonging to (an individual in) this subpopulation recombines 
with another sequence, with high probability this sequence will be a new sequence belonging to the 
rest of the population. Therefore, the new sequence obtained after recombination can be treated 
as a mutant under the infinitely-many alleles model. In this setting, mutation rate is an increasing 
function of L. In [17] . a tree representation of the allelic partition as a function of L is given for 
each core haplotype in a given set of genes suspected to have been selected in humans. The tree 
obtained this way is called "recombination tree". An interesting question is to develop statistical 
methods allowing to detect positive selection from the knowledge of this tree. Here, we assume that 
our (sub)population grows at a Malthusian rate a (supercritical CMJ process). We are able to give 
the asymptotic distribution of the rightmost part of the frequency spectrum for a given mutation 
rate (this corresponds to fixing L in the recombination setting). Since 6 can be seen as a death rate 
when restricting the count to individuals carrying the same allele, the phase transition at 9 = a 
is intuitive. In the recombination tree, this phase transition should translate into a transition, at 
a certain locus length Lq, from a small number of thick branches (L < Lq) to a large number of 
thin branches (L > Lq). We plan to extend this study to a full description of the structure of the 
recombination tree. 

In the next section, we define rigorously the model and recall some chosen results from [2]. Section 
[3] is concerned with the asymptotic behavior, as t — > oo, of the expected sizes and ages of the most 
abundant /oldest families. Sections [4] and [5] deal with the joint convergence in distribution of these 
sizes and ages in the respective cases when clonal families are subcritical or critical. A final appendix 
is devoted to some technical lemmas for the control of moments of order 2 of largest sizes and ages. 
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2 Model, preliminary results and statement of the main results 
2.1 Model 

In this work, we consider genealogical trees satisfying the branching property and called splitting 
trees [HE]. Splitting trees are those random trees where individuals' lifetime durations are i.i.d. with 
an arbitrary distribution, but where birth events occur at Poisson times during each individual's 
lifetime. We call b this constant birth rate and we denote by V a r.v. distributed as the lifetime 
duration. Then set A(dr) := b¥(V £ dr) a finite measure on (0, oo] with total mass b called the 
lifespan measure. We will always assume that a splitting tree is started with one unique progenitor 
born at time 0. 

The process (Nt,t > 0) counting the number of alive individuals at time t is a homogeneous, 
binary Crump-Mode- J agers process, which is not Markovian unless A has an exponential density or 
is a Dirac mass at {+oo}. 
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Figure 1: A coalescent point process. 

In [13], it is shown that the genealogy of a splitting tree conditioned to be extant at a fixed time t 
is given by a coalescent point process, that is, a sequence of i.i.d. random variables H{, i > 1, killed at 
its first value greater than t. In particular, conditional on Nt ^ 0, Nt follows a geometric ditribution 
with parameter F(H < t). More specifically, for any < i < Nt— 1, the coalescence time between the 
i-th individual alive at time t and the j-th individual alive at time t (i.e., the time elapsed since the 
common lineage to both individuals split into two distinct lineages) is the maximum of iJj+i, . . . , Hj. 
The graphical representation on Figure Q] is straightforward. The common law of these so-called 
branch lengths is given by 

p( " > s) - WuTy <"> 

where the nondecreasing function W is such that W(0) = 1 and is characterized by its Laplace 
transform. More specifically, these branch lengths are the depths of the excursions of the jump 
contour process, say F«, of the splitting tree truncated below level t. They are i.i.d. because F" 
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is a Markov process. Indeed, it is shown in [13] that has the law of a Levy process, say Y, with 
no negative jumps, reflected below t and killed upon hitting 0. The function W is called the scale 
function of Y, and is defined from the Laplace exponent if} of Y: 



i>(x) = x- (1 - e~ rx ) A(dr) x € R+. (2.2) 

J(0,+oo] 

Let a denote the largest root of ifi. In the supercritical case (i.e. J^ , rA(dr) > 1), and in this 
case only, a is positive and called the Malthusian parameter, because the population size grows 
exponentially at rate a on the survival event. Then the function W is characterized by 

oo ^ 

e~ xr W(r) dr = , - x > a. 

Actually, it is possible to show by path decompositions of the process Y that 

W(x) = exp(b J dt¥(J >t)\ , (2.3) 

where J is the maximum of the path of Y killed upon hitting and started from a random initial 
value, distributed as V. Note that since Y is also the contour process of a splitting tree, J has the 
law of the extinction time of the CMJ process N started from one individual. 

Throughout this work, we further assume that individuals independently experience mutations at 
Poisson times during their lifetime, that each new mutation event confers a brand new type (called 
haplotype, or allele) to the individual, and that a newborn holds the same type as her mother at birth 
time. The mutation rate is denoted by 9. From now on, Pj (resp. P*) will denote the conditional 
probability on survival up to time t (resp. on the survival event). 

2.2 Expected frequency spectrum 

Recall from the introduction that Ag(k,t) is the number of types carried by k individuals at time t. 
Also denote by Zq{£) the number of individuals carrying the ancestral type at time t. 

In [21 Cor. 4.3], we obtained the following explicit formulae for the expected frequency spectrum 
in the population at time t: 

E M k. t) = WW f ixe.-»* ^ (l - , (2.4) 



—Of / i \ k—1 

e ( 1 N 



and 

and x 

W e (x) :=e- dx W{x) + e [ W(y)e- ey dy, (2.6) 



i.e. W' e (x) = e- 6x W'(x) > and W e (0) = W(0) = 1. Note that W e is the scale function associated 
to the clonal splitting tree [21 Thm. 3.1], i.e. 

"OO ^ 

e~ xr W g (r)dr = — — , x > a - 6, 
o $o(x) 
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where 

il> {x) := x - [ (1 - e~ rx )bnVe 6 dr) = x ^ x + 6 \ ( 2 .7) 

J(0,+oo] X + « 

where V# denotes the minimum of V and of an independent r.v. with parameter 9. 

In addition, we were able to obtain the explicit expected age density of the frequency spectrum [2J 
Eq. (4.5)]: defining Ag(k,t,y;dy) as the number of haplotypes of age in the interval (y,y + dy) 
represented by exactly k alive individuals at time t, we have 

E tM k,t,y;dy) ^OdyW®^ (l - . (2.8) 

In [2], denoting by Ag(t) = X^fc>i Ag(k,t) the total number of haplotypes at time t, we deduced 
from these expressions the a.s. large time convergence of the fraction Ag(k,t)/Ag(t). Recall that 
families with given size k = 1, 2, 3, ... are referred to as small families. Large families are those who 
have the largest sizes and old families are those whose original mutation is among the oldest. We 
are interested in the sizes and ages of large and of old families. For example, the size of the largest 
family is the largest k such that A$(k,t) > 1. 

2.3 Statement of results 

Recall that we always assume that the Malthusian parameter a is positive. The asymptotic size 
of the most frequent and oldest haplotypes strongly depends on the way a and the mutation rate 
9 compare. Since 9 is an additional death rate for clonal families, the case a > 9 corresponds to 
supercritical clonal families, the case 9 = a corresponds to critical clonal families, and the case 9 > a 
corresponds to subcritical clonal families. 

In the whole paper, we are going to use the following notation: for all x, s > 0, define 

• Lt(x) the number of haplotypes carried by more than (or exactly) x individuals alive at time 
t (L for large); 

• Ot(s) the number of haplotypes with alive representatives at time t older than s, i.e. whose 
original mutation has age greater than s (O for old); 

• Mt(x, s) the number of haplotypes carried by more than x individuals alive at time t and whose 
original mutation has age greater than s. By convention, we set Mt(x,s) = Mt(x,0) = Lt(x) 
when s < 0. For < s\ < S2, we also define 

M t (x, s u s 2 ) = M t (x, si) - M t (x, s 2 ), 

the number of haplotypes carried by more than x individuals alive at time t, whose original 
mutation has age in (si,^]. 

Our convergence results are of two kinds: convergence in expectation of Lt(xt) and Ot(st) for 
conveniently chosen xt and st, which are directly obtained from (|2.4p . (|2.5p and f|2.8j) (see Section [3]), 
and convergence in distribution of the point process of (correctly rescaled) largest families or oldest 
families, which require to combine the previous equations with the theory of coalescent point pro- 
cesses \13\ 116] (see Sections H] and [5]) . We obtain different results depending on whether the clonal 
families are supercritical, subcritical or critical. 
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2.3.1 Supercritical clonal families 

Our only result in the supercritical case (a > 9) is the following (Proposition I3.3j ): for all < a < 

b < +oo and c > 0, 

lim E t M t (ce( a -V\t-a 1 ,t-a ) = ?—^ [ * exp (ay - Ma - 9)e^-^ y ) (9dy + 5 (dy)), (2.9) 
*^+°° « Ja v ' 

where 5o denotes the Dirac measure at 0. Note that (|2.7p yields 

, (a - B)i>'{a) 

Ve( a -v) = • 

a 

This result means that the largest families at time t have a size of the order of e^ a ~ d ^, and that 
their age is of the order of t minus a constant, i.e. were born in the first moments of the population 
growth. In particular, the largest and oldest families are the same. 

This result can be interpreted as follows: If Ng(t) denotes the size of a clonal family started at 
time from one individual, then conditional on its survival at time t, Ng(t) e ~( a_e )* converges in 
distribution to an exponential r.v. with parameter ipg(a — 9) [13]. If we restrict the limit in the last 
statement to its Dirac term, we recover the previous convergence stated for the ancestral type 

lim F t (Z (t) > ce^) = — e^^. 

i->+oo a 

The prefactor is the probability of survival of the ancestral family conditional on the survival 
of the whole population, which is the ratio of (a — 9)/b (survival probability of the ancestral family) 
with a/b (survival probability of the whole population). 

In order to recover E^M^ce^ - ^ 4 , t — ai,t — ciq), we need to integrate the mutation rate per 
branch 9 dy against the expected number of individuals alive at time y having at least one alive (not 
necessarily clonal) descendant at time t, times the probability that the splitting tree spanned by one of 
these individuals has at least ce^ a ~^ 1 clonal descendants, which is exactly ¥t-y(Zo(t — y) > ce^^^ 1 ) 
since this splitting tree is not extinct after t — y time units, which converges to 

a ~ Q e -c%(a-B)e( a - e )y 

a 

as t — > +oo. Now, the number of individuals alive at time y having descendants at time t is given 
by the number of branches higher than t — y in the coalescent point process, i.e. has a geometric 
distribution of parameter F(H >t\H>t — y) = W(t — y)/W(t). As will appear in Lemma 13.11 
below, this quantity converges to e~ ay as t — > +oo, which completes the interpretation of each term 
of (|2~9jl . 

Using the theory of time-dependent random characteristics, Z. Ta'ib [18] was able to obtain 
more precise results (but without considering ages of haplotypes) in the case of CMJ processes where 
mutations occur at birth. In this case, using the notation a— 9 for the (positive) Malthusian parameter 
clonal families (to be consistent with our notation), he obtained in Theorem (4.6) the convergence 
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of the number of haplotypes carried by more than ce^^^ 1 individuals to a mixed Poisson r.v. with 
parameter Cw^/ 'ac a ^ a ~ 9 ^ where the constant C is explicit and u>oo is the limit of Nte~ at when 
t — > +00, where Nf is the population size at time t. This result is consistent with ours, although the 
precise value of the constant C is not the same. 

Although not stated in |18j . one can easily extend this result using |14t Thm. A.I.] to obtain the 
convergence in distribution on the event of non-extinction of the point measure rft(-), where T]t([ao, ai\) 
is the number of haplotypes carried at time f by a number of individuals in [aoe^-^aie^*], 
towards a mixed Poisson point process (also known as Cox process) on R + with intensity measure 

» {dx) = (a - ^ 

This is actually the kind of results that we are able to prove when mutations occur during the life 
of individuals and when clonal families are subcritical or critical (see below). Unfortunately, the 
method we develop in Section d] does not apply to the supercritical case. 

2.3.2 Subcritical clonal families 

When a < 9, we prove in Proposition 13.41 that for all a £ R 

lim E t O t (^ + a) =ke- 6a 



t— >+oo V 

for an explicit constant k, and that the maximal size of families older than at/9 + a is tight when 
t —7- +00 for all a £ R. We also prove in Proposition 13.51 that for all c 6 R, 

E t L t (x t (c))~kip(9) c+ ^ x ^\ (2.10) 

where {x} denotes the fractional part of the real number x, i.e. {x} = x — [x\ = x + \—x~\ , where 
[•J (resp. [•]) is the integer part (resp. ceiling) function, and 

xt(c) = k't — k logt + c 

and 

for explicit constants k,k',k". In addition, we prove that the age of these large families is of the 
order of \ogt/{9 — a). Hence the largest and oldest families are different in the subcritical case. 

Note that, both for large ages and large sizes, random fluctuations are of order 1 (the parameters 
a and c are not multiplied by a function of t). This explains why the right-hand side of (I2.1U|) . 
while remaining bounded, depends on t: on the one hand, the size of the largest families grows with 
time as xt(0) and has fluctuations of order 1; on the other hand the size of the largest families is an 
integer and hence Lt(xt(c)) only depends on [xt(c)]. Therefore, as a function of c, the right-hand 
side of (|2.10p must only depend on [~xt(c)~|, which is clear since c + {— xj(c)} = —xt(0) + [a^(c)~|. 

This suggests that, given any sequence of times (tk)k>o such that i& — > +00 and {xt k (0)} =: v does 
not depend on k, the r.v. Lt k (xt k (c)) should converge in distribution, or that the r.v. X^/ — xt k (0), 



S 



where X} is the size of the largest family at time t, should converge in distribution to some r.v. 
with values in Z — v = {b — v,b £ Z}. This is what we prove in Section 01 For example, we shall 
state here Corollary I4.4L 

Let us denote by x)p > x)p > . . . the ordered sequence of family sizes in the population at 
time t. Let also M(M) be the set of nonnegative a-finite measures on R, finite on R + , and let us 
define the semi-vague topology on .M(R) as the one induced by all maps of the form 

v G M(R) h-> / u{x)v(dx), 
Js. 

for all bounded continuous function u such that, for some xo € R, u(x) = for all x < xq. Then, 
Corollary 14.41 states that the sequence of point processes (Zfc)fc>o on Z — w defined by 

n>l k 

converges in P*-distribution on A^(R) equipped with the semi- vague topology to a mixed Poisson 
point measure on Z — {xt Q (0)} with intensity measure 

c&-{x tQ {0)} 

where k is an explicit constant and the mixture coefficient £ has exponential distribution with 
parameter 1. 

We obtain similar results for the oldest families (Theorem |4.5j) : if > > . . . denotes the 
ordered sequence of family ages in the population at time t, the family of point processes (Zt,t > 0) 
on R defined by 

n>l 

converges in P*-distribution on A1(R) equipped with the semi- vague topology to a mixed Poisson 
measure on R with intensity measure 

£ke- da da, (2.12) 

where k is an explicit constant. 

In the case of CM J processes with mutations occuring at birth, Jagers and Nerman [10, Applic. C] 
and Taib [18, Prop. (4.2)] obtained similar results for the extremes of the empirical age distribution. 
Proposition (4.2) of [18] states the convergence of the point process Z t on the event of non-extinction 
to a mixed Poisson measure on R with intensity measure 

k' e~ 0a da. (2.13) 

As already noted, in the case of splitting trees, the distribution of Woo conditional on survival is 
exponential |13j . so that (|2. 12|) has the same form as (|2.13p . although the constant k' is different 
from k. 
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The technique used in [18] makes use of so-called individual time-dependent random characteris- 
tics: for any individual i in the population, we define a nonnegative random process, called random 
characteristic (xu(u),u € K), assigning some score to the individual at age u, such that Xu(u) = 
for all u < 0. The random characteristic depends on an extra parameter t. We can then define the 
branching process counted with the time-dependent random characteristic xt as 



where the sum covers the set of all individuals which lived at some time in the population, and cij is 
the birth date of individual i. In such situations, the results of [10] allow one to prove the convergence 
in distribution of Z** as t — > +oo, under a set of assumptions, among which the more stringent is 
that the random characteristic is individual, i.e. that for all t > 0, the random processes (£i,Xtt) f° r 
i running in the set of all individuals in the population are i.i.d., where (£i(u),u > 0) is the process 
counting the number of children of individual i before age u. 

In [18], this method is applied to the population of haplotypes (i.e. individuals above have to be 
understood as haplotypes), defining for any haplotype i (a variant of) the time-dependent random 
characteristic 



where A, is the life length of haplotype i. 

This method cannot be used when mutations occur during the life of individuals, since the age 
of this individual when the mutation occurs influences the distribution of the progeny of the new 
haplotype (except when lifetime durations are exponential r.v.), which contradicts the assumption 
that (£j, xu) are i.i.d. One may think of defining another random characteristic based on individuals 
rather than haplotypes, counting the number of mutations experienced by each individual, which 
occured more than at/9+a time units ago and which has descendants living at present time. With this 
choice, the random characteristic does not depend on the age of the individual's mother. However, 
it depends on the whole progeny of the individual, so that (£j, Xi) and Xi') are independent only 
if the individuals i does not descend from i! and conversely. Therefore, the method of [18] cannot be 
applied to our case. 

Proposition (4.2) of [18] makes use of precise estimates on the tail distribution of the extinction 
time of a clonal family, which are well-known in this context. No results are given in [18] on the 
size of large families in the subcritical case, presumably because their method would require precise 
estimates on the tail distribution of the size of a clonal population at any time, which are to our 
knowledge not known in general for CMJ processes. In our case of splitting trees with mutations 
occurring during the life of individuals, our formulae (|2.4p and (12. 5ft for the expected frequency 
spectrum are exact. This allows us to obtain precise estimates for the tail distribution of the size of 
a clonal population at some time, conditionally on the survival of the (clonal or not clonal) progeny 
of this population at time t. We are then able to deduce exact asymptotics for the tail distribution 
of the size of the largest family using a different method than in [18] (see the proof of Theorem 14. 2p . 

We chose here to present results on largest and oldest haplotypes, but our method easily applies 
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to intermediate regimes. For example, one can easily adapt our calculations to prove that, for all 
7 G [0, 1] and c G R, 

E t u t (x t (c), sg. t ) -k^e) *-**®), 



where 

. , ail — j)t 
-log(p(9) 

Similarly, one can prove the convergence of the point process of sizes of families older than (ccy/9)t 
as t — > +oo and compute the limit as a mixed Poisson point measure. 



2.3.3 Critical clonal families 

When a = 9, we prove in Proposition 13.61 that, for all a G 

log* 



lim E t O t t — + a = ke~ aa 

t->+oo \ a J 

for an explicit constant k, and that the maximal size of families older than t — logt /a + a is tight as 
t — > +oo for all a G R. As in the subcritical case, fluctuations are of order 1 here. We also prove in 
Proposition 13.71 that, for all c G R, 

lim K t L t (xt(c)) = k exp ( — — - - c 

where 

x t (c) = k't 2 + k"t log t + ct (2.14) 

and the constants k, k' , k" are explicit. In addition, we prove that the age of these large families is of 
the order oft/2. Here, since the fluctuations are of the order of t, the limit does not involve {— xt(c)} 
as in (pJOl) 

We are then able to deduce from these estimates the convergence of correctly rescaled point 
measures of the size of the largest and the age of the oldest families (Theorems 15.11 and \5.3\i : using 
the same notation as in Section [2.3.21 the family of point measures (Zt,t > 0) defined by 



n>l 



converges in P*-distribution on M(M) equipped with the semi- vague topology to a mixed Poisson 
measure on R with intensity measure 

where the constant k is explicit. 

Note that this statement and the definition (|2. 14[) are actually a little bit different than those of 
Sections 13.41 and 15.11 However, the results stated here can be proved by slightly modifying the proofs 
of Proposition 13.71 and Theorem 15.11 We have chosen to leave this to the interested reader. 
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Similarly, we obtain the convergence of the family of point measures (Zt,t > 0) defined by 

n>l ' 

to a mixed Poisson point measure on R with intensity measure 

£e~ aa . (2.15) 

Again, in |10| and [18], a similar result is stated only for extreme ages. It takes the same form 
as our result with £ replaced by Woo and with a different multiplicative constant in the intensity 
measure fj2. 15f) . 

3 Large time asymptotics for the expected number of frequent or 
old haplotypes 

Our goal here is to prove the convergence results on the expectation of Lt(xt) and Ot(st) stated above, 
when clonal families are supercritical, subcritical or critical. We start with preliminary estimates on 
the scale functions W and Wg. 

3.1 Preliminary results on scale functions 

Lemma 3.1 The survival probability of the splitting tree is a/b, and the scale function W has the 
following asymptotic behavior 

W(t)e~ 



_ at _l + 0(e-^) 



as t — >• +oo, for some constant 7 > 0. 

Proof. The expression of the survival probability and the fact that W(t) ~ e at /i^'(a) were already 
proved in [T3]. In order to get the higher-order term, we use the fact that 

P(J>t) = - + 0(e-*) (3.1) 

as t — > +00, where J is the extinction time of the splitting tree started from one individual with 
random lifespan, distributed as V . Indeed, we know from [13] that the law ¥^ := P(- | J < 00) 
is that of a subcritical splitting tree with lifespan measure e~ ar A(dr). In particular, under P^, the 
lifetime V of a single individual has exponential moments, and the first hitting time To of by the 
contour process of the spliting tree also has exponential moments (because its Laplace exponent is 
the inverse of ■= i)(- + a)). Since J < To a.s., J has exponential moments, that is, there is some 
7 > such that E^e 27 ^) < 00. As a consequence, also since a/b = P(J = 00), 

P ( J > *) " T = f 1 " t) P *( J > *) = 0(e-^). 



b V b 
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Therefore, it follows from (|2.3p that 

exp (- Ji°°(&P(J >x)- a)dx) 



W(t)e 



-at 



and the result then follows from (13.11). □ 



Prom this result and the definition (j2.6|) of Wg, we can deduce the following lemma. We recall 
that 

xip(x + 9) , (a-6)ip'(a) 

Ve[x) = —7. — and ^ 6 {a ~ 0) = • 

x + u a 



Lemma 3.2 (i) Assume a > 9 > 0. Then 

e (a-e)t 



W e (t) 



iP> e (a - 9) 
as t — )• +oo. 
(ii) Assume < a < 9. Then 



n e -(6-a)t 

W e(t) ~ T777 mT ( 3 - 2 ) 



^(0) " w W e (a-0)\ 
as t — >• oo, and 

1 -w^r^ )<WW) ' 

where (p(0) is defined in \2.11\) and p(-) is a non-increasing function such that 

as t — > oo . 
(iii) Assume a = 9 > 0. T/ien 

W a (t) = + -^r + a [ + °° (W(y)e- a y - -f-) dy + (1) 

as t — > +oo. 



Proof. Points (i) and (iii) are easy consequences of Lemma 13.11 and the definition (12. 6ft of Wg. 
Point (i) can also be seen as a trivial corollary of Lemma 13.11 since when a > 9, Wg is the scale 
function of a supercritical splitting tree. 

For Point (ii), by Tauberian theorems (see p3]), we have Wg(t) — > l/if>' s (0) = 9/tp(9) as t -> oo. 
Since 

i r°° 
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CUD yields 

Weit) ~W) = e ~ etw{t) ~ 9 1 w (y) e ~ dVd y- 

Since W(t) ~ e a /^'(a), we have 



/■CO 1 f'OQ 

/ W^K^y ~ — - / e'^^dy 
Jt A 



as i — )• +00. This entails (|3.2p . Since one has 

p(0 



I / I y,(g) \ 



and 

l ^(g) = ggy ~ 

the proof of (ii) is easily completed. □ 



3.2 The case of supercritical families (a > 6) 

In the case of supercritical clonal families, the asymptotic expected number of frequent haplotypes 
can be explicitly computed. Note that in the next statement and elsewhere in the paper, the Dirac 
measure at time corresponds to the contribution of the family carrying the ancestral type. 

Proposition 3.3 Assume a > 6 > and let < ao < a\ < +00 and c > 0. For all t > 0, let 

xt(c) = cexp((a — 0)t). 

Then 

lim E t [M t (x t (c),t-a 1 ,t-a )] = ?—?- [ * exp (ay - c^' e (a - 9) e ( a "%) (Q dy + 5 Q {dy)). 
t ^+°° a J an \ ) 



Proof. Using (|2.5|) . (|2.8|) and Lemma EH for all t > a, we have 



K t [M t {x t ,t-b,t-a)}= J2 



e l 1 - 



i-a -0a: / 1 \ 



W/f) / 1 \ \ce(™- e »]-l r t-a -9x / 1 \ rce (Q - 9)t l-l 



6At fly 



exp 



(fce(«-»)*] - 1) i og (l - 1/W e (t - y))j (6dy + 5 (dy)), 

(3.4) 
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where we recall that [•] is the ceiling function. Since Wg is nondecreasing and 1^(0) = 1, it follows 
from Lemma 13.21 (i) that there exists a constant C > such that 



le^-^t < W e (t) < Ce^' 6 *, Vt > 0. 



C 

Therefore, for all y > 0, the quantity inside the integral in the r.h.s. of (|3,4|) is smaller than 

C'e ay exp (-C"e^- 9)y ^ 



for some constants C',C" > 0. Now, using Lemma 13.21 (i) again, for all y > 0, the quantity inside 
the integral in the r.h.s. of (|3.4p converges to 



i>'(a)(a-9) , _ {a _ e)y 



-e ay exp ( -cip'Ja - 0)e { 
a \ 

when t — > +oo. Proposition 13.31 then follows from the dominated convergence theorem. □ 



3.3 The case of subcritical families (a < 9) 



Our first result deals with ages and sizes of the oldest clonal families. Note that the scaling constant 
a varies in M. 



Proposition 3.4 When a < 9, for any a£l, 

lim E t O t [^ + a 

t— >+oo 

In addition, for any xt — > +oo as t — ^ +oo, 



9iP>( y a)' 



lim E t 

t— >+oo 



M t (x t , — + a 



0. 



Proof. Using (|2.5p and (|2 . 8[) as in the proof of Proposition 13.31 we have 



E t [O t (at/9 + a)] = W{t) 



t e -9x 



ip(9)e at rt 



(9dx + 5 t (dx)) 



(3.5) 



(3.6) 



where we used that Wg(x) — > 9/ip(9) as x — > +oo (Lemma 13. 2p . Eq. (|3.5p then easily follows. 
Similarly, 



E t [M t (x t , at/9 + a)] = W(t) 



< 1 



f 


e -e x 




Wg(x) 


1 


\ r*ti-i 


Wg(t) 





1 



1 



Wg(x) 



r«ti-i 



W(t) 



t e -9x 



H* +a Wg(x) 



(9dx + 5 t (dx)) 
(9dx + 5 t (dx)), 
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since Wg is nondecreasing. Eq. (|3.6[) then follows from ()3,5p and the fact that (1 — l/W$(t)y Xt ^ — > 
as t -)• +00, since W$(t) -> 9/if){9) > 1. □ 

Our next result gives the asymptotics of the expected number of large families (notice again that 
the scaling constant c varies in R) and states that their ages all are asymptotically equivalent to 
(9 - a)" 1 log(t). We recall that ip(6) < 1, so that | log tp(9)\ = - log tp(6). 

Proposition 3.5 Assume a < 9. For all c6M, let 

, . at-j^logt 
I log c^(6»)| 



Then, for all e > 0, 



E t [L t (x t (c))]~E 



M, [ x t (c), l—^]ogt, ^-logt 
a — a — a 



A{9) tp{9) c - 1+ {- Xt ^ (3.7) 



as t — )• +00, where we recall that {•} is the fractional part function and 

r(^r-)tp(9) / n2 \g=^ 

A{6):= ^^ '^^"^^(^f # l lo g^)lJ ' (3-8) 

where T(s) = f °° y s ~ 1 e~ y dy is the Gamma function. 

Proof. Proceeding similarly as in the proof of Proposition 13. 3( we deduce from Lemma [3. 2 1 (ii) that 

ft —6x 

E t [L t (x t (c))] = W(t)i P (9)^W~ 1 / ^-(l- p ( x ))\^-\edx + 5 t (dx)) 

Jo Wo{x) 

t 9/(6-a) f mc-l+{- I( ( C )} f t £)*■+( [x t ( c )] - 1) log(l-p(:r)) 

^ ' 1 (9dx + 5 t (dx)) (3.9) 



if/ (a) J W e {x) 

as t — > +00, where we used the relation [a?t(c)] — 1 = xt(0) + c — 1 + {—xt(c)}. We will now abridge 
xt(c) into xt- 

Let e > 0. Let us first bound from above the previous integral restricted to the complement 



of 
We 



^ log ^ log* 



On the one hand, using the inequality log(l — p{x)) < and the fact that 



x) converges to a positive constant when x — > +00, we have for all e > 

+00 ex+(\x^-i)\og{i- P (x)) , j, . 

— — (9dx + 5 t (dx)) <c(e- et + e- e ^ lost ) = o{t 

tewi Wq{x) V J 



: log t 

as i — > +00. 

On the other hand, since p{x) is non-increasing and using (|3.3|) . p{x) > p(-^^logt) > Ct~ 1+e 
for all < x < (1 — e)/{9 — a) logt. Therefore, 

/•yE|logt e -flar+(rxtl-l)log(l-p(x)) _ /-^Eflogt 



W e (x) 



{9dx + 5 t {dx)) < C ~ e~ 9x e -C(r*tl-i)i t- 1+e dx. 
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Since (\xt\ — l)t 1+e > C"t e for t large enough, we deduce that the previous integral is also o(t e - a ). 

are both asymptotically equiv- 



In conclusion, E t [L t (x t (a))] and E t M t \x t (a), j^logt, £±f logi 
alent to Et(e), provided that Et{e) is uniformly bounded from below, where 

E((£) = m tWe -y^ H -M^ jm e _ to+(M _ 1)M1 _ pMW (3io) 

Note that, since Wq(x) — > ip{9)/6 when x — > +00, the replacement of Wg(x) by its limit in the r.h.s. 
of (13. 9p is justified. The proof of Proposition 13.51 will hence be completed if we can prove that 

E t (s) ~ A(9) <p{9) c - 1+{ - Xt( - c)} . 

This is the aim of the rest of the proof. Set 

B: - ow6)m<x-e)v (3 ' n) 

and we make the change of variable y := £?(|~a;t] — X)e~^~ a ' x in (13. 10H : 



EAe) ~ / v 

] 4>>(a)(e-a)(B(\x t ] -!))«/(*-*) J BiM . iyt 



l-e 

1M__1-1). 



expf -l)log(l -pi — ± J J Jdy. (3.12 



Now, using Lemma 13.21 (ii) again, we have for all y > 

pl-1 v ) ^Be los ^k^) 



a J \x t ] - 1 

as t — > +00. Therefore, the exponential in the r.h.s. of (|3.12p converges for all y > to e~ y when 
t — > +00. In addition, using the inequalities log(l — p(x)) < —p(x) and p(x) > Ce~^ e ~ a ^ x for all x 
large enough, we have 



/ / /l gMl£lbl)\ \ \ 
^expf (\x t ] -l)log(l-p( — ^ jjj <y^e 



-Cy 



Since x{t 1 6 — > and a^t 1+e — > +00 when t — > +00, Lebesgue's theorem finally yields 
Remembering that xt ~ ai/| log (p{9)\ as t — > +00 concludes the proof of Proposition 13.51 □ 
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3.4 The case of critical families (a = 9) 

The following result gives the asymptotic expected number of old families and states that their sizes 
are tight. 

Proposition 3.6 Assume a = 9 > 0. For all a G R, we have 



lim E t 

t—>+oo 



O t [t- l ^ + a 



a 



In addition, for all x% —> +oo, 



lim Et 

t— >+oo 



log t 

M t [x t ,t — + a 



0. 



Proof. Using Lemma 13.21 (iii), a similar computation as for Proposition 13.41 yields 



E t [Ot(t-\ogt/a + a)] ~ e 



e" m 1 e 

(dx H — 5Adx)) ~ — 

x a i 



e~ ax dx + — . 

at 



The first limit easily follows. The second limit is obtained exactly as in the proof of Proposition 13.41 
□ 

The computations for the most frequent haplotypes are more involved. The following result 
gives the asymptotics of the expected number of large families and states that their ages are all 
asymptotically equivalent to t/2. 

Proposition 3.7 Assume a = 9 > 0. For all c£R, we define 



x t (c) :-- 



a 



Aij/ (a) 



. log* 

t h c 

2a 



Then, for all e > 0, 



lim E t [L t (x t (c))} = lim E t 

t— >+oo t— >+oo 



, , , , . 1 — e 1 + e 
M ( a^c), — — t, ——t 




27r e B -^e— , 



where 



B = l + a I (tp'(a)W(y)e- ay - l) dy. 
/o 



(3.13) 



Proof. Similarly as in the proof of Proposition 13.51 we have 



E t [L t {x t {c))\ 



Jo W a (x) 



z ilxtic)] ' 1)los ^^)(adx + S t (dx)). 



(3.14) 



18 



Let e > 0. Let us first bound from above the previous integral restricted to the complement of 
[^f^nr*]- Fix r] G (0,1). By Lemma O (iii) , for all x large enough, 1 < W a (x) < ^ (ct " ( t 1 _ ?7 ) ■ 
Hence, using the fact that xt(c) ~ a 2 t 2 /Aip'(a), for t sufficiently large, 



t Q~ OX ' r ' ^ - > ' f - 1 ^ Ft 



i±i* W Q (x) 7l±£ t 



2 



<a / e V *" ; ( ix + e- at ( 1+ ( 1 - 2f ')/ 4 ) 



-t 



The quantity inside the integral in the integral of the r.h.s. is maximal for x = (1 — 2r/) 1 / 2 t/2, which 
is outside the integration domain, so that 



~ P V V 2 2(1 + e) 



l+£ f 
2 1 



This last quantity is o(e a ') if one chooses 77 < e 2 /2. 

Using the fact that W a (x) is non-decreasing, larger than 1 and that e~ ax < 1, we have 

l0Si ^1(^(0)1-1)^(1-^) -XHiMtl 



adx < ae Wa(i°g*) lost. 

W a [x) 

Since xt(c) ~ a 2 i 2 /4?//(a), using Lemma f3T2l (iii) again, for t large enough, the previous integral is 
smaller than 

Qlog(t)e- Ci2 / log * = o{e- Qt ), 

for a constant C > independent of t. 

Finally, using Lemma 13.21 (iii) similarly as above, for all r\ 6 (0, 1), for t large enough, 



; (M C )l-i)io g (i-w^) a£fa < a /^Y^+^L 



Taking 77 small enough so that (1 — ?y) 1 / 2 t/2 > t(l — e) /2, the previous quantity can be bounded from 
above by 

ate -f(l-e + ^) =o{e -« t) 

if one takes again rj < e 2 /2. In conclusion, E t [L t (x t (c))] and ~E t [M t (x t (c), ^j^-t, -^t)] are both 
asymptotically equivalent to E t (e), provided that E t (e) is uniformly bounded from below, where 

Et (e) ■= ^- ^1,(^(01-1)^(1-^) 

Therefore, it only remains to prove that 

lim E t (e) = \ — e B ~ e~ ac . 
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Using the facts that W a (x) ~ ax/ip'(a) and that | log(l — l/W Q (:r))| < C/t for all x large enough, 
we have 



1+6 



when t — > +00. 

It follows from Lemma 13.21 (iii) that 



k(l ') = « t *t« + ,(I| (3 . 15) 
1 Wa\x) J ax crar \x z ' 



as x — > +00. Therefore, 

1 \ rp'(a) ij/(a)(B-i//(a)/2) 



F(t) := x t (c) sup 



log 1 



W a (x) J ax 



as t — >■ +00. Hence, using the facts that xt(c) ~ . and that x S [^o^i, ^o^iji for t large enough, 



2 e e « cte < £? t (e) 



(l + e)t 



5~* 



2e C£ f'f* „, t ( e )V'(q) 

< 7- — e 2 e at e ax dx, (3.16) 



(l-e)* 



2 r 



for a constant C independent of e and t. 

Now, let us compute the asymptotic behavior of the integral involved in these inequalities: first, 
the change of variable x = fyy with /3 t = y/ xt(c)ip' (a) / a yields 

"^ £ * ax afe»V) , „ (W 1 -aftf^+l) 
e ax dx = f3 t I e y y Jdy. 

Next, we introduce the new change of variable 



2 + z 2 + zV^ 2 + 4 
y = 2 • 

This defines a C 1 -diffeomorphism from 2; G (—00, +00) to y E (0, +00) such that 

y + - = 2 + z 2 . 

y 

Note that z > if and only if y > 1, which means that 



; s«Ti(y - \ y + ~ ~ 2, 



where sgn(x) = 1 if x > and — 1 if x < 0. Since 

~ 2' 



A = 1 ~ h (3-17) 
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the inequality (1 — e)t/2j3 t <1<(1 + s)t/2{3 t holds for t large enough, which yields 



l+£i , „ / (l+£)t , 2ft ~ 



Now, 



lim , ILzi)l + 7 JA _ 2= « a „ d lim 



t^+oo y 2p t (l-e)t ' y/T^e ' ' t^+oo y 2ft (l + e)< VTTe 
Since z + is C 1 in the neighborhood of 0, with value 1 at z = 0, we obtain 



/ (l+g)t 2ft i+e 

(1 - Ce)P«r**> I v 2ft (1+£)t " e-^ 2 cfz< / 2 e — ^^^d* 



2ft n (l-e)t * 2 



-f 



2ft +TT=I)t ^ 



for £ large enough. Making the last change of variable u = \J2aj3t z finally yields 



(1 _ C 'e)\A-^ < e—^^dx < (1 + C'e)\[^e-^- 



a / V a 

' 2 c 



for i large enough. Combining this with (I3.16p . we obtain that, for all e > small enough, there 
exists to > such that, for all t > to, 



(1 - C"e) 2 -e B -^e«\r—z- 2aPt < ^e) < (1 + C'ef-e^e^J^e-^ 
t V a t V a 

where the constant C" is independent of e and t. It then follows from (|3.17p that 

2vr B _^M. _ c . „ , v W1 , w/ v /2tt h_*!M „- ac 



(1 - C"eW — — a e" QC < £U £ ) < (1 + C"e)\ — e B a e 

V a V a 

Since we have shown in the beginning of the proof that Et(e) = Et(e') + o(l) for all e' < e, the 
previous inequality applied to Et(e') concludes the proof of Proposition 13.71 □ 



4 Large or old families: convergence in distribution for subcritical 
clonal families 

In all this section, we assume a < 9. Our goal is to compute the joint limiting distribution when 
t — > +00 of the sizes of the largest families living at time t, and of the ages of the oldest families 
living at time t. Before this, we give estimates on the second factorial moment of the number of 
large families, used repeatedly in the sequel. 
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4.1 A preliminary lemma 



Let us recall that, under P^, we can adopt the representation of the genealogy at time t by the 
coalescent point process Hq, Hi, H2, ■ ■ ., where Hq = +00 and the (Hf, i > 1) are i.i.d., killed at their 
first value (= Nt) larger than t. For all i > 0, we call branch i the lineage represented by H{. 

For all t > 0, x > 1, < s\ < S2 < +00, we define Kt{x, s\, S2) as the number of haplotypes 
carried by more than x individuals alive at time t, whose original mutation occurred on the ancestral 
lineage (branch 0), and has age in (s%,s 2 ] (or, equivalently, in {s\,s 2 At]). 

Lemma 4.1 For all t > 0, x > 1, < si < S2 < t, we have 



E t [M t (x, Sl ,S2)(M t (x, Sl ,S2) - 1)] < 2 E t K t (x, Sl ,s 2 ) 



1 + 



1 1 

W( Sl ) W(t) 



W(t) 



EtN< 



b(l + 6(s2- Sl )) 



o 



Wg(s 2 ) 



fxl-1 



,-081 



% e -e z We(s2)-W e{si - z ) dz 



W e {s 2 ) 



+4 W \ Sl) ^ W Et ^ t (fx/21, S1 , S2 ) 



(4.1) 



and 



E t i^(x,si,s 2 ) < 



6 r 2 



a 



a 



1 



Wg(y) 



1 



fxl-1 



" ae -e z We{y) -W e {y - z) ^ + 







W e (y) 



fxl-1 



This lemma is proved in the Appendix. 



(9dy + 6 t (dy)). 



(4.2) 
(4.3) 



4.2 Convergence in distribution of the size of the most frequent haplotype 

Let us recall the notation x\ 1 ^ > > . . . > x\ k ^ > . . . for the ordered sequence of sizes of all 

(k) 

living families in the population at time t (with the convention that X\ = when k is larger than 
the number of living haplotypes at time t). Our first goal is to prove the convergence in distribution 
of X^ using only the exact formulae (12. 4p and (12.51) and the coalescent point process construction 
of the genealogy of the splitting tree. 

The general idea is the following one. We divide the population at time t into several sub- 
populations corresponding to distinct ancestors at a given time s, as shown in Fig. [2j This gives 
a sequence of sub-trees (7i)i<i<N t «> where Nt, s is the number of individuals alive at time s having 
descendants at time t. These individuals are represented by crosses in Fig. (2) We choose s = St in 
such a way that Nt iSt — > 00 and the event under consideration (here, the event that there exists a 
haplotype carried by more than xt individuals at time t) has a small probability in each sub-tree and 
are "nearly independent" (in a sense specified in the proofs below) in distinct sub-trees. The key 
argument of the proof, for which Lemma [4. II is needed, consists in checking that, in each subtree, the 
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T\ 75 % T\ Tr, Tc, T; Tgj T<> j Tin 



-x -x - x- -x x- x- x- -x x- 



at 



Figure 2: The definition of the sub-trees {jX)l<i<N t s ■ The first vertical line represents the ancestral 
lineage (branch 0) and the other vertical lines have i.i.d. lengths Hi, H2, ■ ■ ■ The rightmost vertical 
line is the first one higher than t, with length Hjv t - The crosses represent the A^ St individuals alive 
at time St and having descendants at time t. Here, Nt st = 10. 

unknown probability that there exists a haplotype satisfying the property under consideration (here, 
carried by more than xt individuals) is close to the expected number of such haplotypes, which is 
known explicitly. Here, this reads 

F t [L t _ St {x t ) > 1] ~EtL t _ at (s t ). 

Theorem 4.2 Assume a < 9. For all c£K, let 

at — 77—— log t 

MC )■■= fa\] +c (4-4) 

\log (p{6)\ 



Then 

¥ t (Xr < x t (c)) ~ 



(1) , „ 1 



1 + A(0)<p{6) c - 1 +{- x *W 
as t — )• +00, where A{9) is defined in {g.gp . 



Proof. Set 



F(t,x) := P 4 (X t (1) >x)= F t [L t (x) > 1], G(t,x) := E t [L t (x)} 
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and 

F(t,x,s) :=F t [M t (x,s) > 1]. 

We have for all s < t 



< F(t, x) — Ft [3i £ {1, . . . , Nt jS } : % contains a haplotype 

carried by more than x individuals^ < F(t,x,t — s). 

which also reads 
< F(t, x) - 1 + Et[(l - F(t - s, x)) Nt ' s ] 

= F(t, x) - -. r- < F(t, x, t - s), (4.5) 



l+F(H>t\H >t-s) {-f^-\ 

since N tjS is a geometric random variable of parameter F(H > t \ H > t — s). In view of this, in 
order to find a non-trivial limit for F(t, xt), we need to find st and xt such that F(t, xt,t — st) = o(l) 
and F(t — St,x t ) = o(l) and is asymptotically equivalent to 

F(H >t \ H >t- s t ) = W(t- s t )/W{t) ~ e~ ast (4.6) 

as t — > +oo. In order to find an explicit asymptotic equivalent of F(t — st,xt), we will compare it 
with G(t — st, xt). 

Let us check that the choice xt(c) in (|4.4p for xt and 

st = s t (b) = t-b\ogt 

satisfy the above properties for all b > 1/(9 — a). 

On the one hand, the fact that F(t,xt(c),t — st(b)) = o(l) is an immediate consequence of 
Proposition 13.51 since 61ogt > (1 + e)logt/(9 — a) for some e > 0. On the other hand, we can 
compute an asymptotic equivalent of G(t — st(b), xt(c)) following closely the computation of the 
proof of Proposition 13.51 Here are the main steps of the computation: we have 

G(t - s t (b),x t (c)) = W(t - s t (b)) \1 - J ±-^(1 - p{x))^-\6dx + 8 t _ st{h) (dx)) 



e 

rs_- — ■ 



-as t (b) 4»/{e-a) (ay-l+{-X(c)} ft-s t {b) -Ox 

*w /. ^p-^» r - w, - 1 (« fc + 



It is easy to deduce from Lemma 13.21 (ii) that the contribution of the Dirac mass and of the integral 
on the interval [0, (1 — e)logt/(9 — a)] for any fixed e E (0, 1) are both o(e~ ast ^). Using the fact 
that Wq(x) —> 9/ip(9) when x —> +oo and the change of variable y = B(\x t (c)] — l)e~^~ a ' x , the 
contribution of the integral on the interval [(1 — e) \ogt/{9 — a), blogt] is asymptotically equivalent 
to 



^(a)(9-a) \B(\x t (c)} - 1) J 

,B(Mc)l-l) *-(!-) / / /log Z(\xt(c)]-1) 

/ y a '^ exp (\x t (c)] - 1) log 1 - pi — * | j | dy. 
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As in the proof of Proposition 13,51 Lebesgue's dominated convergence theorem then yields 

G(t - s t (b),x t (c)) ~ A(9) ip(6) c - 1+ {- Xt ^F(H >t\H>t- s t (b)) (4.7) 

as t — > +00, where we used the fact that xt(c)t~ b (®~ a ' — > when t — > +00 since b > 1/(9 — a). 
Now, it only remains to check that 

G(t - s t (b),x t (c)) ~ F(t - s t (b),x t (c)) (4.8) 

when t — )• +00. Since for all t, x > 

< G(t,x)-F(t,x) = E t (L t (x))-¥ t (L t (x) > 1) < E t [L t (x)(L t (x) - 1)], 

it is sufficient to prove that 

E t [L t _ st(b) (x t (c)) (L t _ st{b) (x t (c)) - 1)] =o(G(t - s t (b),x t (c))) = o{ e - as ^). (4.9) 

Taking s\ = and S2 = t — st(b) in Lemma 14.11 (|4.ip yields 

E* [L t _ at(b) (x t (c)) (^_ st(b) (x t (c)) - 1)] < 2E t K t _ St(b) (x t (c))(l + E t N t _ St{b) )x 
~b(l + 8(t-s t (b))) / 1 



a 



( 1 " -*(&)) J + m ^-^K^ St{h) (x t {c)/2) 



where we used the notation K s (x) := K s (x,Q,s). Using the inequality E t K s (x) < E t L s (x) = G(s,x) 
and the estimates 

1 a(t-s t {b)) 

E ' jV «-<» = kh >t- Stm = ^ - *™ ~ ~naT- < 4 ' 10 > 

and 

G(t - S t (b),X t (c)/2) ~ 2 e /( 9 - Q ) A(0) e -a(«(b)-*/2) r W-a) ) 

which can be proved following the same computation as above, we finally obtain for t large enough 

[£t- St (fc)Mc)) (l- t _ st ( 6 )(st(c)) - l)] 

<CG(t- s t (b),x t (c)) (>(*-^ fc )) logte- at ^/( e - Q ) + e M^W) e -a(»t(6)-t/2) r */a(*-art ( 

which entails (14. 9h and concludes the proof of Theorem 14.21 □ 
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4.3 Joint convergence in law of the sizes of the most abundant families 



The general idea of the previous argument can be summarized as follows: we construct a random 
number N^ St of i.i.d. random variables X\ , . . . , Xjsr t st — the sizes of the most frequent haplotype 
in each sub-tree — such that X^ = max{Xi , . . . , X^ t s } with high probability. Our previous re- 
sult then corresponds to a classical argument of extreme value theory, which is known to extend 
easily to compute the extremes statistics and the joint law of the largest random variables among 
X\, . . . ,Xn . The object of this subsection is to prove that this argument is valid in our situation. 



Theorem 4.3 Assume a < and recall the definition of xt(c) in ^4-4\ )- For all n G N, k\, ■ ■ ■ ,k n G 
Z + and ci, . . . , c n such that a > c«+i + 1 for all i G {1, . . . , n — 1}, we have, as t — > +oo, 



L t {x t (ci)) = kt, L t (x t (c 2 )) - L t {x t (cx)) =k 2 ,..., L t {x t (c n )) - L t (x t (c n -i)) = k n 

k X + ... + k n \ T t ( Cl ) fel (T t (c2) - T t { Cl )) k2 . . . (T t (c n ) - T t {Cn„x)) k " 



k\ , . . . , k n 



(l + r t (ev l ))^+-+ fc »+ 1 



, (4.11) 



with 



T t (c) := A(e)ip{9) c - 1+ {- Xtic ^ 
for all c G M, where the constant A{6) is defined in $3.8\) . 



Proof. Let us denote by A(t; ci, . . . , c n ; k\, . . . , k n ) the event in the probability in the l.h.s. of (|4.1ip . 
Using the notation of the proof of Theorem l4.2l for all b > 1/(9— a), we define B(t,b; c\, . . . , c n ; k\, . . . , k n ) 
the event that among the sub-trees 71, . . . ,T/v tjS( (fc), there are exactly ki haplotypes carried by a 
number of living individuals at time t belonging to [xt(ci+i), xt(ci)) for all < i < n — 1, with the 
convention co = +oo. Then 

¥ t (A(t;ci,. ..,Cn]ki,... ,k n )) - F t (B(t,b; ci, . . . ,c n ; fei, . . . ,k n ) 

<¥ t [M t (x t {c n ),t-s t (b)) > 1] =F(t,x t (c n ),t-s t (b)) = o(l). 

Now, for all fixed t > 0, using the notation X\, . . . , X^ ttSt introduced above, we define for all a G M 

St(a) = #{i < N t>st : X % > a}. 

Defining 

C{t, b] c\ , . . . , c n ; k\ , . . . , kn) := 

|s' i (x t (ci)) = kt, St{x t {c 2 )) - 5 t (zt(ci)) = k 2 , ■ ■ ■ , S t (x t (c n )) - ^^(cn-i)) = k n \, 
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we have 



F t (B(t,b; ci, . . . ,c n ; fa,..., k n )) - F t (C(t,b; ci, . . . , c n ; fa, . . . ,k n )) 

< F t ^3i < N ti8t Q^ : % contains at least 2 haplotypes carried by more than xt{c n ) individuals 

<J2nH>t\H >t-s t (b))F(H <t\H > t- stib))*- 1 fcE t [L t _ St(6) (x 4 ( Cn ))(L t _ St{b) (x t (c n )) - 1) 
k>l 

Ef [L t _ St(b) (x t {c n )){L t _ St{b) (x t (c n )) - 1)] 
F(H >t\H>t- s t (b)) 

where we used the fact that N t ^ St ^ has geometric distribution with parameter F(H > t \ H > 
t- s t (b)). Equations (gSJ) and (@79]) then yield 

Pt(A(i; ci, . . . , Cn; fa, . . . , k n )) - F t (C(t, b; a, . . . , Cn 4 , fa, ■ ■ ■ , K) = o(l). 

Next, F t (C(t, b; c±, . . . , c n ; fa, . . . , k n )) can be computed using standard extreme value techniques: 
conditioning on A^ )St (M and considering all the possible ways to realize this event, we have 

F t (C(t,b; ci, . . . ,c n ; fa,. . . ,k n )) 

F(H >t\ H >t-s t (b))F{H <t\ H yt-stib))*- 1 ( k 

W t \Xi, ••• ,X kl > x t (ci) > X kl +i, ■ ■ .,X kl+k2 > x t (c 2 ) > ... > x t (c n ) > X kl+ .„ +kn+1 , ... ,X k 

- e n» > « i * > « - -mm* f^>'- CC:::,t") (*, + * + 

F(t- St (6),x t ( Cl )) fcl [F(t- St (6),x t (c 2 ))-F(t- S4 (6),x t ( Cl ))] fc2 x ... 

... x [F(t - s t (b),x t (cn)) - F(t - s t (b),x t (c n -i))] kn [1 - F(t - s t (b) , x t (cO)]*-* 1 --*" . 



The equation 



yields 



E 

k>m 



m 



„k—m 



(1-x) 



m+1 ' 



Vx G (-1,1), m G N 



\(C(t,b; ci, . . . , c n ; fci, . . . , k n )) 

F(H >t\H >t- s t (b))F(H <t \ H >t- St(fr)) fcl +-+ fc "- 1 
" [F(H >t\H>t- s t (b)) + F(t - s t (b),x t (c n ))F(H <t\H>t- s t (b))] kl+ - +kn+1 * 

\ + " ' t S " s*(b), ^(ci))" 1 [^(t - «t(6), x t (c 2 )) - F(t - st(b),x t ( Cl ))] k > x . . . 

K\ , . . . , K n / 

... x [F(t-s t (b),x t (c n ))-F(t-s t (b),x t (c n ^))} k \ 



and Theorem [4731 then follows from (jITSI) . (14771) and (jiTSI) . 



□ 
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In order to state our next result, we define the number t n by the equation Xt n (Q) = n. In view 
of (|4.4p . this equation has a unique solution t n if n is large enough, say larger than no- In addition, 



and hence t n —> +00 as n — > +00. 

We also recall the notation .M(R) for the set of nonnegative a- finite measures on R, finite on R+, 
and the definition of the semi-vague topology as the one induced by all maps of the form 



for all continuous bounded function u on R such that there exists xq E R such that u{x) = for 
all x < xq. Note that this topology is stronger than the usual vague topology, but weaker than the 
usual weak topology. 

Corollary 4.4 Assume a < 9. Then, the sequence of point processes (Z n ) n > no on Z, defined by 



converges as n — > +00 in P* -distribution on set M(M) equipped with the semi-vague topology to a 
mixed Poisson point measure on Z with intensity measure 



where the mixture coefficient £ has exponential distribution with parameter 1. 

The proof of such results is quite standard (cf. [11] in the general context of point processes 
and [13] more specifically on extreme values). However, we shall give a proof for sake of completeness 
and because of the specificity of the semi- vague topology. 

Note that one can also easily obtain the convergence, in the sense of finite-dimensional distribu- 
tions, of any finite sequence of translated extreme family sizes towards the corresponding sequence of 
extreme points of the limit point process in the previous result. We shall not prove this, but instead 
we refer to [J3] for the proof of similar standard results. 

Proof. Let us first prove the convergence in distribution when .M(R) is equipped with the vague 
topology. This amounts to prove the joint convergence in distribution of the random variables 
L tn (n+i)-L tn (n+i + l)) = L tn (x tn (i))-L tn (x tn (i + l)), giving the number of haplotypes represented 
by exactly n + i individuals at time t n , for b < i < a for all a < b in Z. Fix b < a in Z and kf,, . . . ,k a 



log¥>(fl)|(fl-q) 

e 



n, 
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in Z.J-. On the one hand, we claim that 



lim F tn (L tn (n + i) - L tn (n + * + 1) = k h V6 < i < a) 

n— >+oo 



lim ^P^CM^CO) - L tn (x tn (i + l)) = ki,Vb<i<a and L* n (x tn (a + 1)) = k) 
fca + ... + fc 6 + fcUt>+ l) fe K(a) - r t „(a + l)) fc ° . . . (r tn (b) - r tn (b + l)) k » 



n— »+oo 

fc>0 



E 



k a ,...,k b ,k J (1 + rt (6)) fc +*»+-+fc»+ 1 



fc>0 

fc a + . . . + fc 6 \ [A(l - p)]*a+-+*% ¥ ,(o-l)fc„+...+(6-l)* i 



k a ,...,k b J [1 + A{ip b ~ l - cpa^ka+.-.+kb+l 
where we used the fact that, for all x G Z and n > uq, 



T tn (x) = Atp*' 1 with A = A(0) and y> = y>(0) := 1 ^ 

This is an immediate consequence of Theorem 14, 3^ provided we can justify the exchange of the sum 
over k and the limit n — > +oo, i.e. that we can control the remainder of the series uniformly over 
n>riQ. The following inequality, making use of Proposition 13.51 solves this question: for all iV G N, 

Pt„0Mz tn (*)) " LtnMi + 1)) = h, V6 < i < a and L tn (x tn (a + 1)) > N) 

< P*„0M^> + 1)) > N) < Et " Lt " ( ^ (a + 1)) < ± supE t L t (xt(a + !))<£ 

for some constant C > 0. 

On the other hand, assume that £ is an exponential random variable with parameter 1, and 
(Px)xez is a sequence of r.v. with P x distributed as a mixed Poisson with parameter £A{1 — ip)(p x ~ x 
and such that the r.v. (P x )xeZ are independent conditionally on £. Then, 



poo 

P(p. = jfe., V6 < i < a) = / dxe~ x e " A ( 1 -* , ) I ^« ,B " 

Jo 



(xA(l — ip)) k °-+-+ k b y,(a-l)feo+-+(6-l)*6 



o fcj • • • 



L4(l — ip\] k a+-+ k b ( p(a-l)k a +...+(b-l)k b 

kj... k b \ [1 + A(l - if) Yl =b ^-l]fea + - + * 6 + l JQ 

fc a + . . . + A; 6 \ [A(l - (/7 (a-l)fc a +...+(6-l)fc ! , 



yk a +...+k be y^y 



6-1 _ , / ,a N ilA: a +...+A: f ,+l 



k a ,...,k b J [1 + A^- 1 - <p a )) 

where we used the change of variable y = x(l + A(l — (p) J2m=b l P m ~ 1 )- Observing that P* n converges 
to P* for the total variation norm, this ends the proof of Corollary 14.41 when J\A(M) is equipped with 
the vague topology. 

To complete the proof of Corollary 14.41 since all the point measures Z n have support in Z, we 
need to check that, for any continuous bounded function / on R and any sequence (u(k))kez such 
that u(k) = for all k < ko for some ko 6 Z, 

JunE tn f([ u{x)Z n {dx) \ = E/( £ u{k)P k \. 

' \k>k ) 
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Note first that the sum in the r.h.s. is almost surely finite since, conditional on £ , this is a sum of 
independent r.v. with only finitely many of them being non-zero by Borel-Cantelli's lemma. 
Next, fix e > and let a and T be large enough so that 



and, by Theorem 
Then, for all n such that t n >T 



l + r tn (a) 1 + Atp 
(i) 



sup F t [Xl 1 ' > x t (a)) < 2e. 

t>T 



E *' l/ (i U(x)Zn(<ix) ) ~ Ein/ ( ^ u(k){L tn (x tn (k))-L tn (x tn (k + l)))^ 



< 4e 



The first step of the proof then yields 



E tn f n u (x)Z n (dx)j -Ef( u(k)P k )\ 



<(4||/||oo + l)e 



for all n large enough. Since 



P(3fc > a : P k > 1) < Y, ¥ (Pk > 1) = J> " ^ ) < Aj^v"' 1 < J 

k>a k>a 

we finally obtain 



k>a 



Etn/ (X U(X)Zn(dX) ) ~ E/ ( ^ 

which ends the proof of Corollary 14.41 



<(4||/|| 00 + l + (l- V p)- 1 ) £ , 



□ 



4.4 Convergence in distribution of the ages of oldest families 

The previous method can easily be extended to prove the convergence of the ages of the oldest 
families. Let us recall the notation > A^ > . . . > A^ > . . . for the ordered sequence of ages 
of all alive families at time t (with the convention that A^' = when k is larger than the number 
of alive families at time t). 



Theorem 4.5 Assume a < 6 and define for all a £ 



xt(a) 



at 

~r + a- 



For all n £ N, k%, . . . , k n G Z + and a\ > a 2 > . . . > a n , we have 



lim F t 

t— >+oo 



O t (x t (ai)) = ki, Ot{xt{a 2 )) - O t {x t (ai)) =k 2 ,..., O t (x t {a n )) - O t {x t (a n -i)) = k n 



{a) fh + ... + k n \ e -Skia^ e -9a 2 _ e ~9a^k 2 _ _ ( f 



ip(0) V ki,...,k n 



-9a n -i \kn 



(H' 
[7* 



(a) 



+ e 



-9a r 



ki+...+kn+l 



-. (4.12) 
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In addition, the family of M(M) -valued random variables (Zt,t > 0), defined for all t > by 

Zt := Y. 6 4 k) -^> 



k>l 

converges as t — >■ +00 in ¥* -distribution in M(R) equipped with the semi-vague topology to a mixed 
Poisson point measure on R u>i£/z intensity measure 

S^e-e-da, (4.13) 

where the mixture coefficient £ has exponential distribution with parameter 1. 

The proof follows closely the lines of those of Theorems 14,21 and 14.31 and Corollary 14.41 As a first 
step, we prove the following lemma. 

Lemma 4.6 With the same notation as in Theorem \4-5{ we have for all a E R 

lim P t [dp < ^- +a 



t— >+oo 



1 + ?Kg) P -e a ' 

1 + <?</>' (a) e 



Proof. The proof of this lemma is similar to the one of Theorem 14.21 Defining for all t, x > 

F(t,x) = PtiA^ >x) = F t [O t (x) > 1] and G(t,x) = E t [O t (x)}, 
we have for all x < s < t 

< F(t, x) — Pt( 3i G {1, . . . , iVt )S } : 7i contains at time t a haplotype older than x 



= F(t, x) -. r- < F(t, t-s)< G(t, t - s). (4.14) 

l + F(H>t\H>t-s) (j^-l) 

Defining 

s t (b) = bt with b E (0, 1 -a/9), 
following the proof of Proposition 13.4} one easily checks that G(t,t — St(b)) = o(l) and 

G(t-s t (b),x t (a)) ~Ae-^), (4.15) 
vip'(a) 

where we stick to the notation xt(a) = ^ + a. Then Lemma 14.61 follows from (j4.6|) and the fact that 

G(t - s t (b), x t {a)) ~ F(t - s t (6), x t (a)) 
as t — )• 00. To prove this last equation, it is sufficient to prove that 

E t [O t _ St{b) (x t (a))(O t _ St{b) (x t (a)) - 1)] = o{e~ as ^\ (4.16) 

as in (Ol) . 
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Now, we observe that 

i _ i 1 

F(H > s\H <t) = S3 pfi. „ „ ^'( a ) e - QS when s, t -¥ +00 with t - s -»• +00 (4.17) 

1 ~ W7JJ 

and, by Lemma 13.21 (ii), for a constant C that may change from line to line, 

, r (Q) fl „-fe - f*W) - ^(^(q) - z) 
+ Jo W e (t-s t (b)) 

fx t (a) 

= e -tet(«) + / e - fe ^(x t (a) - z)^(0)[p(x t (a) - z) - p(t - s t (b))]dz 

x t (a) 

e ~ dz e -i e - a )( x t{a)-z) ^ z 







< e -tet(a) + C I 
JO 

Combining these two facts with (I4.10P and Lemma 14.11 (14. ip in which we take x = 1, s\ = xt(a) and 
S2 = t — st(b), we have 

E* [O t -« ( 6)(a:t(a)) (O t _ st(6) (ar t (a)) - l)] < CE t tf t _ St(6) (M t (a),i- St (&)) (l + e~ ax ^ e a ^ s ^) x 
"(1 + t)e -(e-«)Ma) + e -«*(«) e a(t- s *(6)) Et ii: t _ st{6) (1, x t (a), t - St (6))] . (4.18) 

Here, in contrast with the proof of Theorem 14.21 the bound K s (l,x,s) < O s (x) is not sufficient to 
obtain the desired result. Instead, we use Lemma l4~T1 (14. 2p : 

%K t _ St( p-)(l,x t (a),t - s t {b)) 

l ft-s t (b) / rv \ 

<-J [e- 9y + ^ ee- ez W e (y - z)ip(0)[p(y - z) - p(y)]dzj (6dy + 5 t _ St{b) (dy)). 

Hence, by Lemma 13.21 (ii) again, 

E t K t _ St(b) (l,x t (a),t-s t (b)) < C / + / e-^^e^'dz ) {6dy + S t _ at{b) (dy)) 

Jx t {a) V JO J 

ft-s t (b) 

<C / e-^y(edy + 5 t _ st{b) (dy)) 

J xt(a) 

Together with (|4.18p . this yields 

Et [Ot-s t (b)(.x t (a)) [O t _ st{b) (x t (a)) - l)] < C(a) e - as ^ + e -««) = ( e — *(*)), 

where the constant C(a) depends on a but not on t. This concludes the proof of Lemma |4"U1 □ 
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Proof of Theorem 14.51 Equation (|4,12p can be deduced from Lemma 14,61 exactly as Theorem 14,31 
was deduced from Theorem 14.21 We leave the details to the reader. 

In view of (|4.12p , the computation in the proof of Corollary 14.41 immediately proves (replacing 



Tt n {a) with jjtt-t e aa ) that, for all a\ > a 2 > . . . > a n , the random vector 



O t (x t (a 2 )) - O t {x t (ai)), O t (x t (a n )) - O i (x t (a n _i)) 

converges in distribution as t — > +oo to a vector whose coordinates are independent conditionally on 
£ and have mixed Poisson distributions with mixture coefficient £ and parameters 

0tp'{a) dtp' (a) 

It is then standard to deduce the convergence in distribution of Zt to P on .M(R) equipped with 
the vague topology (cf. e.g. [H] Thm. 4.7]). The semi- vague topology can then be handled similarly 
as in the proof of Corollary 14.41 Again, we leave the details to the reader. □ 



5 Large or old families: convergence in distribution for critical 
clonal families 

The method that we used in the previous section can also be applied to the case where a = 8. All 
the proofs are similar, and we will only give details at places where the proofs differ. We use the 
same notation as in the previous section. 

5.1 Frequent haplotypes 

Theorem 5.1 Assume a = 9. For all c£K, let 

( \ » 2 ( + lo ^t j_ N 2 
X H C ) = a 77? \ 1 ~ "?i ^ c 



Aip' (a) V 2a 
For all n € N, k%, . . . , k n € Z + and c\ > C2 > . . . > c n , we have 



lim F t 



L t (x t {ci)) = fei, L t (x t (c 2 )) - L t (x t (ci)) = k 2 , ■ ■ ■ , L t (x t (c n )) - L t (x t (c n -i)) = k n 
~a e _ B+ tlSpl fh + . . . + k n \ e _a * ici (e- aca - e ~ aci ) k2 . . . (e~ aCn - e - ac «-i^ 




2vr V h,...,k n J ( B ,v>'(«) x fci+...+fc„+i 



where the constant B is defined in A3.13\) . In addition, the family of At(M) -valued random variables 
(Zt,t > 0), defined for all t > by 

k>l V f 2y r i?l^) y - 2a I 
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converges as t — > +00 in P* -distribution in M(M) equipped with the semi-vague topology to a mixed 
Poisson point measure on R toit/i intensity measure 

£V2^e B -^e~ ac dc, (5.1) 
where the mixture coefficient E has exponential distribution with parameter 1. 

The proof of this result is exactly the same as for Theorems 14.31 and 14. 5[ provided we can prove the 
following lemma. 

Lemma 5.2 With the same notation as in Theorem \5.1l for all c£l, 

lim P t (X t (1) < x t {c)) 



Proof. The proof of this result is similar to the one of Theorem 14.21 Fix e > 0. We first observe 
that Proposition 13.71 implies that 



t 



M t (x t (c),0,^—^t) > 1 J <E t M t (x t (c),0,^—^t) =o(l), 



and thus 

P 4 (X t (1) > x t {c)) = F t [M t (x t (c), i^t) > lj + o(l), 

so that it is enough for us to study ¥ t (M t (xt(c), s^) > 1), where we put 

S f) := lift. 
* 2 



Defining 



F(t,x) =P t [M t (x, S J 1} ) > 1], G(t,x)=E t [M t (x,s ( t 1) )} 



and 

F(t,a:,s) =F t [M t (x,s) > 1], 

we can make the same computation as in the proof of Theorem 14.21 to show that (14. 5p holds true if 
t — s > s\ . So let us define 

s t (b) = bt, where b € (0,1/2). 
By Proposition 13.71 we immediately have F(t,Xt(c),t — st(b)) = o(l). We observe that 

/t—st(b) „-ax 
^^ e (lMc)]-i)io S (i-i/w a{x) ) {adx + St _ st(b) (dx)). 
(i) W a (x) 

Using the inequality log(l — x) < — x, Lemma l3.2l (iii) entails that the contribution of the Dirac mass 
is 

O (I e -*t(c)/W„(*-*(6)) 
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Fix rj £ (0, 1). Using the expression of xt(c) and the fact that 1/W a (t) > (1 — rj)i// (a) / (at) for t 
large enough, this last quantity is 



where the last equality is valid if one chooses 77 < (1 — 2b) 2 . 
Hence 

ft—st(b) „-ax , . 

G(t-S t (b),X t (c)) = W(t-S t (b)) / Ji^ e (\Mc)]-l)log(l-l/W a (x)) adx + a f e -as t (b) \ _ 

j s w w a (x) v y 

Now, the integral in the r.h.s. is exactly the same as in (j3.14j) . except for the interval of integration. 
We actually proved in the proof of Proposition 13.71 that . since (1 — b) > 1/2, this integral is equivalent 
to 



l+e 



t „—ax 



2 e 



2 



e (lMc)]~l)log(l-l/W a (x)) adx ^ 



1 , W a (x) 



which is itself equivalent to 



i /2tt B _ilM _ ac 
e 2 e . 



W(t) V a 
Therefore, 

G(t - s t (b),x t (c)) ~ e - QSt W x ^L e B ~^ e - ac , (5.2) 

V a 

and, recalling that (|4.5p holds (with our current notation), the proof of Lemma [5.21 will be completed 
if we can prove that 

G(t-s t (b),x t (c))~F(t-s t (b),x t (c)) 



as t — > +00. Again, this is implied by the estimate 

t (b)(xt( c ),4 )( M t-s t (b) 



^[M^vwCxtCcJ.^JCM^^CxtCc),^) - 1)] = o(e~ as ^), (5.3) 



which we now prove. 

Applying Lemma UTTl (|4.ip with x = xt(c), si = s[ and S2 = t — st(b), and combining the result 
with (|4.17j) and the fact that, for all s < t, 

„- + r ^-wm-wu—)^ < e -«, + r _ lf 

we obtain 

E t [M t _. t(6) (o; t (c),« t (1) )(M t _. t(b) ( a ; t (c),^ 1) ) - 1)] 

<CW. t K t _ St{h) (x t (c\s?\t- s t (b)) (l + e—^e^W)) x 



(5.4) 
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Fix again 77 G (0, 1). Using the inequality log(l — x) < —x and Lemma 13.21 (iii), we have for t large 
enough, 

[a*(«01-i 

I — I < gexp | - , - ; | t A 



W a (t - s t (b)) 



or I 2 tlogt 



< C exp 



4^' (a) 
4(1-6) *' 



+ 2ct 



(l-^'(a) 



a(t - s*(6)) 



where we used the inequality 1/(1 — 6) <2 to upper bound the exponent of t in the last inequality. 
Using Lemma 14.11 (j4.3|) , this last inequality yields 



5 ft-s t {b) 



a 7„(i) 

ail 



1 



1 



W a (y) 



1 



r**(c)i-i 

(ady + S (dy)) 

fa: t (c)l-l f -t-st(b) 



a\ W a {t-s t (b)), 



(i) 



[ady + 8 (dy)) 



Similarly, 



^K^ H ^{xt{c)/2,8^\t- 8t{b)) < - (1 



a V W a (t - s t (b)) 



\x t (c)/2\-l r t-s t (b) 



(1) 



(ady + <5 (dy)) 



< C exp 



"(1 ~ ??) 
" 8(1 - 6) 



t) t 5 / 4 . 



Combining the previous inequalities with (|5.4p . we finally obtain 

E t [M^, t(6) (x t (c),4 1) )(M t _, t(6) (x t (c), a | 1) ) - 1)] 
< Ci 3 e - QSt(6) exp ( -at 
1 — 77 



1 — e I — 77 



exp —at 



4(1 - b) 



1 1 , 1 — £ 1—77 

+ cxp(-at[6+ — + 8^-! 



Remember now that e and 77 are free parameters in (0, 1). We may assume that they are linked to b 
by the equation 



1 — 77 1 — £ 



4(1-6) 2 ' 
which is always possible since b < 1/2. This yields 



or 



1 e — 7/ 



2 2(1 -e) 



E t [M^, t(6) (x t (c), S i 1) )(Af t _ St(6) (x t (c) jS ^) - 1)] 



exp —at 



1 - 3e 



+ exp —at — 



1 e - 77 7e 



4 2(1 -e) 4 



Taking b close enough to 1/2 allows to take both e and rj as close to as desired. Therefore, (|5.3p is 
proved and the proof of Lemma 15.21 is completed. □ 
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5.2 Old haplotypes 

Theorem 5.3 Assume a = 9 and define for all a € R 

r \ * log ^ i 
xAa) = t h a. 

a 

For all n k\, . . . , k n 6 Z + and ai > a2 > . . . > a n; we /iawe 

O t (x t (ai)) = fei, O t (x t (a 2 )) - O t (x t (ai)) = k 2 , . . . , O t {x t {a n )) - O t (x t (a n -i)) = k r , 



ki + ... + k n \ e - akiai (e- aa2 - e - aai ) k2 . . . (e - f 



lim P t 

t— >+oo 

~ a [ y h,...,k n ) (a + e- aa ") fcl+ - +fc ' l+1 ' ( ' 5 ' 5 '' 

In addition, the family of .M(R) -valued random variables (Zt,t > 0), defined for all t > fry 

fc>i 

converges as t — > +oo m P* -distribution in M(R) equipped with the semi-vague topology to a mixed 
Poisson point measure on R intensity measure 

8e' da da, (5.6) 

where the mixture coefficient £ has exponential distribution with parameter 1. 

Again, this result follows from the next lemma exactly as Theorem 14.51 followed from Lemma [4.61 

Lemma 5.4 For all a G R 

lim pA«<t-^+a^- 



t->+oo V a 11+- e~ aa 

Proof. We define F(t,x) and G(t,x) exactly as in the proof of Lemma 14.61 and we put 

s t (b) = - log* with b G (0,1). 

a 

With this new notation, (|4.14|) holds true and Proposition 13.61 implies that G(t,t — st(b)) = o(l). In 
addition, one checks exactly as in the proof of Proposition 13.61 that 

„-aa 

G(t-s t (b),x t (a)) e- as *V 

a 

when t — > +oo. The proof will then be completed if we can prove (|4.16p . We first observe that 
Wa(x) = e~ ax W (x) is bounded thanks to (|2.3p . Therefore, by Lemma 1331 (iii). 

„-«*(«). r {a) .t.--«« Wa{t - st{b)) - Wa{xt{a) - z ± jz 

1 ' W a (t- St (b)) 

<c (*-<*+ r l '\~*-«®-*{ a)+ *d,) 



o t - s t (b) 



t 



37 



Combining this inequality with (|4.17p and Lemma |4,1I (|4.ip in which we take x = 1, si = xj(a) and 
S2 = t — st(b) yields 

Et [O t _„(6)(ar t (a)) (O t - st(6) (x t (a)) - l)] < CE^_ Sf(6) (l,x t (a),t - *(&)) (l + e «(*-*W-**(«))) x 

logt 



(1 + logi)-5- + e^-tW-^WJEt^^Cl.xtCa),* - ^(6)) 



(5.7) 



By Lemma 14.11 (|4.2[) , we have 

l rt-st(b) / fy W (u) — W in — z) \ 

E t K t _ St{b) (l,x t (a),t-s(b)) < - / (ady+«5 t _ St(b) (dy)) e"^ + / ae'" ^ °, L dz) 

a A t (a) V Jo W a {y) J 

Using again the fact that W' a (x) is bounded and that 1/W a (y) < C/y for all y large enough, we 
deduce that 

E t K t _ St(b) (l,x t (a),t- s t (b)) < / (ady + 5 t _ St{b) {dy)) + - / ze~ az dz 

Jx t {a) \ V JO 

rt-s t (b) i 

< C / -(ady + 5 t _ Stl b){dy)) 

J x t {a) V 

t 

Therefore, it follows from (|5.7h that 

Et [O t _ St(6) (x t (a)) (O t _ St(6) (^(a)) - l)] < Ce^W log* + ^) = o(e~ as ^). 

This completes the proof of Lemma 15.41 □ 

A Proof of Lemma 14.11 

Recall the notation introduced in Section 14.11 

As seen in the proof of Theorem 14.21 this result (and actually all the results of Sections H] and [5]) 
are consequences of estimates of the form 

F t (M t (x t ,s t ) > l)~E t M t (x t ,s t ) 

as t — > +oo, for convenient choices of xt and st- We chose to prove this result using the inequality 

< E t M t (x t ,s t )-F t (M t (x t ,st) > 1) < E t [M t (x t ,s t )(M t (x t ,st) - 1)], 

i.e. proving that 

E t [M t (x t ,st)(M t (xt,st) - 1)] = o(E t M t (xt,s t )) . 

Such results are obtained using Lemma [4. 1\ which is an immediate consequence of the following two 
lemmas. 
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We need to define the random variable K[(x,s\,S2) by slightly modifying the definition of 
Kt(x, s\, S2): introducing an independent random variable H' distributed as Hi conditional on 
{Hi < t}, K' t (x, s\, S2) is the number of haplotypes carried by more than x individuals alive at 
time t, whose last mutation occurred on branch 0, and is older than s± and younger than S2 A H' . 

As a first step, we compute an upper bound of E t [M t (x, si, S2){M t (x, s\, S2) — 1)] expressed in 
terms of K t and K[. 

Lemma A.l For all t > 0, x > 1, < s± < S2 < +00, we have 



Proof. We let Mi be the number of mutations on branch i (this branch has length Hi), considering 
only the mutations younger than t when i = 0. For all j < Mi, we define tij the duration elapsed 
since the j-th oldest mutation on branch i, with ^(m^+i) = 0i iio = Hi and £00 = t. We also define 
Mq as the smallest k > 1 such that ^ofc ^ H' (and Mq = there is no such k > 1). 

For < j < Mi, denote by R 1 / the number of individuals alive at time t descending clonally from 
the time interval ijj := (t — £ij,t — 011 branch i. More specifically, for a progenitor individual 

alive on the time interval (a, b) and experiencing no mutation between times a and b, we refer to 
'clonal descendants from the time interval (a, 6)' as those individuals alive at t (including possibly 
the progenitor) descending from those daughters of the progenitor who were born during the time 
interval (a, b), and that still carry the same type the progenitor carried at time a. Using the notation 
Aij = Aij(t,x,si,s 2 ) := {R\ ] > x, £ [si,s 2 )}, we have 



E t [M t {x,s 1 ,S2){M t (x,s 1 ,S2) - 1)] < E t [K t (x,s 1 ,S2)(K t (x,s 1 ,S2) - 1)] 



+ (E t N t )E t [K' t (x, Sl , s 2 )(K' t (x, Sl , s 2 ) - 1)] 

+ 8(E t N t )(E t K t (\x/2] , Sl ,S2))(E t K' t (x, Sl ,s 2 )) 

+ 8(E t N t ) 2 (E t K' t (\x/2],s 1 ,S2))(E t K' t (x,s 1 ,s 2 )). 



M t (x, Sl ,s2)= Yl lA <>i + Yl Yl 1a - 



0<j<M l<i<N t l<j<Mi 




0<j<M 



and 




E t M t (x,s 1 ,S2) =E t K t (x,s 1 ,S2) + ^2J2F t (A ij ,i < N t ,j < Mi) 



= E t K t (x,s 1 ,S2) + J2J2F t (A ij ,j <Mi\i< N t )P t (i < N t ) 



= E t K t (x, Sl ,S2) + Y,J2F t (A 0j ,j < M^)F t (i < N t ) 



= E t K t (x, Sl ,s 2 ) + (E t N t - l)E t K' t (x, Sl ,s 2 ). 



39 



Now, 

Nt-1 

M t (x,S 1 ,S 2 )(M t (x,S 1 ,S 2 )-l) = 2 ^0^+2 E 

0<k<j<M i=l l<k<j<Mi 

M N t -1 Mi ^ M l Mi 

+ 2 EEEwu 3 + 2 EE^w 

k=0 i=l j=l l<l<i<N t k=l j=l 

Hence, using a similar computation as above, 

E t M t (x,s 1 ,s 2 )(M t (x,s 1 ,s 2 ) -1) 

= E t K t {x, Sl ,s 2 ){K t (x, Sl ,s 2 )-1) + 2J2 E E P *^o fc n A °i>i ^ <W < N t ) 

i>l k>l j>k 

+ 2 E E E Ft( - A M n ^3' k < M o,i< N t ,j < Mi) 

k>0 i>l j>l 

+ 2 E E E E Ft ( A ° k n A (i-i)i> k < M o,i-l<N t ,j < Mi-i)F t (l < N t ) 

1>1 k>l i>l j>l 

= E t K t {x, Sl ,s 2 )(K t (x, Sl ,s 2 ) - 1) + {E t N t - l)E t K' t (x, s u s 2 )(K' t (x, s u s 2 ) - 1) 
+ 2 E E E Ft( - Aok n ^3' k < M o,i< N tJ < Mi) 

k>0 i>l j>l 

+ 2(E t N t - 1) E E E Ft( - Aok n ^j' k <K^< N tJ < Mi). 

k>l i>l j>l 

For short, we write 

E t M t (x, Sl ,s 2 )(M t (x, Sl ,s 2 ) - 1) 

< E t K t (x, Sl ,s 2 )(K t (x, si, s 2 ) - 1) + {E t N t )E t K' t {x, s u s 2 )(K' t (x, Sl ,s 2 ) - 1) 

+ 2 E E E P *^o fc n A v n + 2 ( E * N *) E E E P *( A o fc n A v n S U) ( AJ ) 

fc>0 i>l J>1 k>l i>l j>l 

where B ijk := {k < M ,i < N t ,j < and B' ijk := {k < M' Q ,i < N t ,j < M;}. 
Now for any positive integers i, k, define the three following events 

a ik := { max Hj > £ 0k }, (3 ik := {£ (k+i) < max ffj < £ ok }, j ik := { max Hj < £ ( k+1) }. 

We are going to state and prove six inequalities, where the left-hand side is obtained by intersecting 
each event A$ k n Aij n B,ij k or Aofc n Aij n with each of the preceding ones ati k , 7^, and 
summing over i,j > 1 and k > for the events involving -B^fc, and > 1 for the events involving 
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Y,^t{a ik r\A ok r\A ij r\B ijk ) < {E t N t )(E t K t (x,sus 2 ))(M t K^x,sus 2 )), (A.2) 
Y,^t{a ik n A ok n Aij n B^ fc ) < (E t N t )(E t K' t (x, s u s 2 )) 2 , (A.3) 

^P t (7i*nA)ji,n^n%fc) < (E t iV t )(E t ir i (x,si,s 2 ))(E t ^(x,si J s 2 )), (A.4) 

^F t ( lik n ^ofc n Ay n B' ijk ) < (E t N t )(E t K' t (x, 8 U s 2 )) 2 , (A.5) 

i,j,k 

^Pt^niofcn^n^) < 2(E t N t )(E t K t (\x/2], Sl ,s 2 ))(E t K' t (x,sus 2 )), (A.6) 

Mi* 

P t (ft fc n A ok n Ai n B' ijk ) < 2(E t N t )(E t K' t (\x/2] , Sl ,s 2 ))(E t K' t (x, s lt s 2 )). (A.7) 

Combining these six equations with (|A.1|) and with the inequalities K t (x, si, s 2 ) < K t (\x/2~\ , s±, s 2 ) 
and K' t (x, s\, s 2 ) < K' t {\x/2\ , si, s 2 ) yields the inequality given in the lemma. 

We are going to detail the proof of the inequalities (IA.2j) , (jA.4ft and (IA.6P (in this order) . The other 
inequalities can be proved using the same computations. Let us start with (lA.2p . Hereafter we denote 
by AqI the event {Io k has more than x clonal descendants within {0, . . . , i — 1} and t§ k G [s\, s 2 )}. 

¥ t (a ik n A ok n Aij n B ijk ) = ¥ t {a ik n A® n A {j n B ijk ) 

< F t (k < M ,A®,i < N t ,j < M h A^) 

= F t (k < M ,A®,i < N t ,j < M h A^ | % < N t )F t (i < N t ) 

= F t (k < M , A® | % < NjFtiAijJ <Mi\i< N t )F t (i < N t ) 

= F t (k < M ,4l,i < N t )F t (A 0j ,j < Mq). (A.8) 

Then, denoting p k the index of the [~x]-th individual carrying the type of interval /ofc ( := +°° if 
there is no such individual), 

i,j,k \0<fc<A/ i>l J j<M^ 

= Et( 2 l£ 0fce[ s llS2 )(^-Pfc) + E t ^(x )Sl)S2 ). 

\0<fc<M / 
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Now, the lack of memory property of geometric distributions yields 



>0<fc<M 



J2^t(N t -pk I k < M ,e k e [si,s 2 ),Pfe < oo)P t (/c < M ,4fc e [si,s 2 ),/9fc < oo) 

fc>0 

(E t iV t )£> t (fc<M ,4)*) 



fc>0 

= (E t iV t )(E t K t (x, Sl , S2 )), 
which entails ()A.2|) . 

Next, let us proceed with (|A.4j) and let cr^ denote the label of the first branch with length 
greater than £ (fe+l) • Observe that conditionally on ^o(fc+l)j A)fc is independent of the branch lengths 
occurring before a k , and further, the events {i < a k } = {maxj<j_i Hj < ^o(fc+l)}> {Hi < ^o(fe+i);J — 
Mi,Aij} and {/c < Mq^Aq^} are independent. As a consequence, 

^t{lik n ^ fc n ^ n B fifc ) = P t (i < a k , k < M , A ok , j < M^A^) 
= P t (i < <T fc) A; < M ,A ok ,Hi < £ 0(k+1) ,j < M^A^) 

= Et(P t (i < a fc | 4(fc+i))Pt(A ofc) fc < M | f (fc+i))Pt(fii < 4(fc+i),i < Mi,Aij \ £ 0{k+1) )) 
< Et(P t (i < <7 fc ,4,fc,fc < M | 4(fc+l)) P t(j < M,^- | 4(fc+l))) 
= Et(P t (i < o-fc^ofc,^ < M | V+i)) p *0' < Mi,Aij)) 
= ¥ t (J < M^,A 0j )¥ t (i < a k ,A ok ,k< M ). 



As a consequence, since a k and {A; < Mq, Ao k } are independent conditionally on 



-o(fc+l)> 



E p *(7ifc n 4* n ^ n B ijk ) < (E t K' t (x, s 1 ,s 2 )) E Ei E 1 fc<A/ ,A 

i,j',fc fc>0 

= (E t K' t {x,s 1 ,s 2 ))Y J E t (P t (fc < M ,^ofc I %+i))E t (cr fc | %+!))) 



fc>0 
fc>0 

(E^(x, si, s 2 ))(E t K t (x, Sl ,s 2 ))(E t N t ), 



(E t K , t (x,s 1 ,s 2 ))Y,^t (Pt(fc < M ,^ofe I V+i))^^)) 



which is (|A.4p . 

Finally, let us turn to (|A.6p . Denote by A^ (resp. A^ k ) the event that there exists at least 
\x/2\ individual with label smaller (resp. larger) than i descending clonally from the time interval 
iofc and that t$ k E [si,s 2 ). Then 

Wik n A ok n Aij n B ijk ) < ¥ t {(3 lk n A$ n A tJ n B yfc ) + P t (ft fe n A^f n ^ n B ijk ). 
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Let us deal with the first term of the right-hand side of this last inequality. Exactly as in the proof 

of <m , 



MPik,k< M ,A'$,j < Mi,Aij,i < N t ) < P t (k < M ,A'$,j < M, u A iv i < N t ) 

= F t (k < M ,A'$,i < N t )F t (A iv j < Mi 
= P t (fc < M ,A'$,i < N t )¥ t (A 0j ,j < Mq 



and we finally get 



J2F t (p ik nA'$ nAijnB ijk ) < (E t N t )(E t K t ([x/2],s 1 ,s 2 ))(E t K' t (x,s 1 ,s 2 )). (A.9) 

As for the second term, we need to define Jj the unique integer satisfying ^(jj+i) < maxi<j<j Hj < 
A)Ji (Ji '■= +°° on — N t } and Jj = k on f3 ik ). Then 

Y^tiPik n Kf n A> n B iJk) =J2*t(Kjl A ii>3 <M u i< N t ). (A.10) 

Set also £* := the age of the oldest mutation on branch Hi (£* = if Mi = 0). Then conditional 
on {i < Nt} and on the value of £*, the numbers of clonal descendants R] of the interval Iij and the 
number, say ifW, of haplotypes whose last mutation is older than £* and si, younger than S2, and 
occurred on lineage 0, and with more than \x/2] alive clonal descendants with labels larger than i, 
are independent, so that 



(A'W, Aij,j <M t ,i< N t ) < F t (K® > l,Aij,j < M t ,i < N t ) 



= E t (l i<Nt ¥ t (K {i) >l\i< N t ,£*)P t {Aij,j < M % \ i < N u £*)) 

< ^ t (li<NMKt{\x/2\, Sl ,s 2 ) > l)P t (AjJ < Mi | i < N t ,£*)) 

= Pt(Kt(\x/2], Sl ,s 2 ) > lptiAij, j <Mi\i< N t )¥ t (i < N t ) (A.ll) 

< (E t Kt(\x/2],s u s 2 ))F t (A 0j ,j < M^)F t (i < N t ). (A.12) 



We finally obtain 



Wife n A'W n A id n B ijk ) < (E t N t )(E t K t (\x/2\ , Sl , s 2 ))(E t K' t (x, s u s 2 )), 

i,3,k 



which completes the proof of ()A.6P by summing the last inequality and inequality (|A.9|) . 

The proof of ()A.7p is very similar, but needs further explanation. Let us define the events A^ 
and Aq^ similarly as above, with the additional condition that £o k < H' . Then, we first prove that 

Y/tifiik n 4fc n A iJ n B 'ijk) < (EtN t )(E t K' t (\x/2] , Sl ,s 2 ))(EtK' t (x, s h s 2 )) 

following the very same computation as for (|A.9|) . Next, we observe that Pt(A$p) = since H' < t 

a.s. and £qo = t. Therefore, (|A.10|) also holds true with our new definition of A'^\ Thus, defining 

as the number of haplotypes whose last mutation is older than £* and si, younger than s 2 and H' , 
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and occurred on lineage 0, and with more than \x/2\ alive clonal descendants with labels larger than 
i, the computation of (|A.12|) is true, provided that K t (\x/2], s\, S2) is replaced by K' t {\x/2\ , si, s 2 ). 
We then obtain 

^P t (/3 ifc n<f n^nsy < (E t Ar t )(E t ^([x/2l, Sl)S2 ))(E t ^( a r )S i, S2 )) ; 

and the proof of (|A.7p is completed. □ 

Lemma 14.11 follows from the combination of the previous lemma with the following estimates on 
K' t (x, si, s 2 ) and -K" t (x, si, s 2 ). 

Lemma A. 2 For aZZ i > 0, x > 1, < si < s 2 < i, ioe have 

1 1 



Et^(x, si, s 2 ) < W ^ l] _ J^ t} E t K t (x, si, s 2 ), (A.13) 



VK(si) W{t) 

1 

(A.14) 

1 _ 1 

EtK;(x, S i, S2 )(^(a;, Sl , S2 ) - 1) < ^ f 7 ® E t K t (x,s 1 ,S2)(K t (x,s 1 ,s 2 ) - 1), (A.15) 

1 w(t) 

E t K t (x, Sl ,s 2 )(K t (x, Sl ,s 2 ) - 1) < — (E^(x, si, s 2 )) (1 + #(s 2 - si))x 

a 

Proof. With the notation of the proof of Lemma IA.ll we have 

E t K' t {x,s l ,S2) = Y,Vt(A ok ,k< Mq) 
fc>i 

= ^P ^ (^,A : <M ,F , >£ fc) 
fc>i 

<^P t (^0fc^<M )P(F / > Sl ) 
fc>i 

<E t K t (x, Sl ,S2MH > Sl \H<t), 

which is inequality (1A.13P - Similarly, 

E t K' t (x,s 1 ,S2)(K' t (x,s l ,S2) - 1) = 2Y,^ t (A 0k ,A 0j ,j < M' Q ) 

fe>l j>k 

<Y,J2 F ^ A ^ A ojJ < Mo)P( J ff' > si) 

fe>i j>fe 

< E t [K t (x, Sl ,s 2 )(K t (x, si, s 2 ) - 1]P(# > si I < t), 



44 



which is inequality (|A.15p . 

For the two other inequalities, let us define R^' b ^ the number of individuals alive at time t 
descending clonally from the time interval (a, b). More specifically, given a progenitor individual 
alive on the time interval (a, b) and experiencing no mutation between times a and b, R[ a ' b ^ is the 
number of individuals alive at time t (including this progenitor if b > t) descending from those 
daughters of the progenitor who were born during the time interval (a, b), and that still carry the 
same type that the progenitor carried at time a. Since W$ is the scale function associated with the 
clonal reproduction process, for all k > 0, 

F(4 a 'Q = k ) = F(Nf_ a = k\C = b-a) 

= F(Nj>_ a 7 t0\( = b- a)F(N?_ a = k | N e t _ a + 0) 

w e (t-b) \ ( i y- 1 i (A17) 

lt>h W e (t-a)J { W e (t-a)J W e (t - a)' (A " 17) 

where N e is the population size process of a clonal splitting tree and £ is the lifetime of the progenitor. 
(This result is actually Eq. (5.3) of [2].) 

Note that, by construction of the splitting tree, replacing in the definition of R^' the progenitor 
individual alive on the time interval (a, b) by a clonal lineage alive on the time interval (a, b), does 
not change anything to the distribution of R,[ a ' b \ By lineage alive on the time interval (a, b), we 
mean here a finite sequence of individuals (ik)i<k<K such that individual i\ was alive at time a, 
individual %k was alive at time b, and for all 1 < k < K — 1, individual ik+\ was born from individual 
ik at some time such that a\ > a and ax-i < b. By clonal lineage alive on the time interval (a, b) 
we mean in addition that for all 1 < k < K, individual experienced no mutation during the time 
interval ((%_!, ajt), where ao = a and ax = b. 

Now, by definition of Kt(x, s±, S2), we have 

E t K t (x, Sl ,s 2 ) = Y, E t [%e[«i,. a ] P * (^f ^ x I Vi > 



k>0 



where 



/ yR$ k >x\£ 0j ,j>0 



F(R° t k >x, N t >l\e Qk ,£ 0(k+1) ) 
P(JVt > 1) 

< (i? t 0A; > », iv t > 1 1 4(fe+i)) , 

since P(A r ^ > 1) > P(A r s > 1, Vs > 0) and the survival probability of the splitting tree is a/b. Now, 
the event {Nt > 1, R® > x} is the event where Nt > 1 and the clonal lineage on branch on the 
time interval (t — £ok,t ~ ^o(fe+i)) nas more than x clonal descendants alive at time t. Therefore, 

RT > x I V j > 0) < ^ P (ijj*^'*-^^)) > x 1 . (A.18) 

Now, for all k > 0, t — £ofc is distributed as the minimum of t and a sum of k i.i.d. exponential 
random variables of parameter 8, and t — £o(k+l) as the minimum of t and the sum of t — lok and an 
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exponential random variable of parameter 6, independent of t — lok- Therefore, it follows from (|A.17p 
that 

^E t K t (x, Sl ,s 2 ) < J dz6e~ dz l S2=t P(i?J M > x) 

+ J2 \~ ****** [ Z dy lye[t-s 2 ,t- Sl ] m?'^ > x) 



fc>l ' 

oo 







, S2=t , dzee- ez ( i - t z<t " 1 1 1 



w e [t-z)\ r i 



W e (t) ) \ W e (t), 



+ / dy6e oy I dz9e- dz (1-1 



t- S2 """" Jy V z<t ~W e (t - y) ) y W e (t-y) 

1 \ l-t i„n„-6{z-u )We{t-Z 



Wg(t - y) 



/ i \ \x\-x / ft 

^ <»* + *(*» (1-^5^) (l -jf*^: 

Equation (|A.14p then follows from the changes of variables z' = z — y and y' = t — y, and the identity 
1 = e -% + J* e - 0z dz. 

Finally, let us turn to (|A.16|) : first, 

E t K t (x,8i,8 2 )(K t (x,a 1 ,az)-l)=2 ^ P t (4y, A 0fc , k < M ). (A.19) 

0<j'</c 

Now, fix k > I > 0. Since ^oj > 4>(j+i) — ^Ofc > ^o(fc+l)> on the event {Aoj, Aq^, k < Ma}, we have 
- 4fc < s 2 - si, 4fc < s 2 and 4(fc+i) > «l - (4* - 4(fc+l))- Therefore, using (jA.18jl as before, 



P t (4y, A ofc , fc < M ) = E t [h 0j£[suS2 f t > x | toj, £ 0{j+1) ) t iok e [si ,s 2 fi (R° t h > x | £ ok , £ (fc+i) 

^e[si,s 2 ] > x I t-Qji 40'+l)) 1 ^0( 3 +i)-^0fc<S2-si x 



a 

where the last indicator comes from the fact that 1 — l/W^ofc) = when ^ofc = 0. Now, on the 
event {4ft > 0}, one has 

£ 0n = t- Ei- E n , V0 < n < k 

and 

4(fc+i) = V (t - E 1 ! - . . . - -Efc+i), 
where (E n ) n >i is a sequence of i.i.d. exponential r.v. of parameter 0. In addition, on the event 
{hk > si, lok ~ 4(fc+l) < si}> one has £ ok - 4(fc+i) = #Jfc+i- Hence, 



\(4oj,4>fc,fc<-Mb) < -E t 



X ls i+2+ ...+£; fc < S2 _ Sl 1 - tE k+1 < Sl 



W (s 2 ) J V W fl (s 2 ) 
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where 

L 0j = V (t - Ei - . . . - Ej) and LqQ+1) = V (t - E x - . . . - E j+ x). 
Since (Loj, LoG+i))) (Ej+2i ■ ■ ■ j-^fc) an d Ek+i are independent, we finally obtain 

F t (A 0j ,A ok ,k< M ) < - ^ Pt(Ay, j < M )F(E j+2 + . . . + E k < s 2 - si)x 

0<j<k 0<j<k 



w e (s 2 ) yy V w (s 2 

- (E t K t (x, s u s 2 )) V P(£i + . . . + < s 2 - si) x 



j>0 



x I 1- f fe-'-^;-')^ U - 1) W " . (A.20) 



o 



W e (s 2 ) J V W e (s 2 



Now, we have 

F{E 1 + ... + Ei<s 2 -8i) = l + E(P) = 1 + #(s 2 - s x ), 

where P is a Poisson r.v. of parameter 9(s 2 — Si). Combining this equation with (|A.19|) and (|A.20p 
ends the proof of (|A,16p . □ 
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